Blog, Data Warehouse and Metadata - Data Engineering Digest

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

This belief has led us to developing Privacy Aware Infrastructure (PAI) , which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as purpose limitation , which restricts the purposes for which data can be processed and used. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Cloudera

SEPTEMBER 29, 2020

Performance is one of the key, if not the most important deciding criterion, in choosing a Cloud Data Warehouse service. In today’s fast changing world, enterprises have to make data driven decisions quickly and for that they rely heavily on their data warehouse service. . Cloudera Data Warehouse vs HDInsight.

Data Warehouse

Data Warehouse Cloud Storage Metadata Cloud

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Cloudera

DECEMBER 11, 2020

In a previous blog post on CDW performance, we compared Azure HDInsight to CDW. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to EMR 6.0 (also powered by Apache Hive-LLAP) on Amazon using the TPC-DS 2.9 More on this later in the blog.

Data Warehouse

Data Warehouse Metadata Datasets Machine Learning

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

AUGUST 3, 2018

Data powers Uber’s global marketplace, enabling more reliable and seamless user experiences across our products for riders, … The post Databook: Turning Big Data into Knowledge with Metadata at Uber appeared first on Uber Engineering Blog.

Metadata

Metadata Big Data Transportation Data

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

To ensure comprehensive protection, it is essential to apply the necessary steps to all systems that store or process data, including distributed systems (web systems, chat, mobile and backend services) and data warehouses. Consider the data flow from online systems to the data warehouse, as shown in the diagram below.

Metadata

Metadata Data Utilities Data Warehouse

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Cloudera

JANUARY 15, 2021

Cloud data warehouses allow users to run analytic workloads with greater agility, better isolation and scale, and lower administrative overhead than ever before. The results demonstrate superior price performance of Cloudera Data Warehouse on the full set of 99 queries from the TPC-DS benchmark. Introduction.

Data Warehouse

Data Warehouse Cloud Consulting SQL

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

By Anupom Syam Background At Netflix, our current data warehouse contains hundreds of Petabytes of data stored in AWS S3 , and each day we ingest and create additional Petabytes. Some of the optimizations are prerequisites for a high-performance data warehouse.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. link] Jon Osborn: Best Practices for Using QUERY_TAG in Snowflake The modern data warehouses are good at running at scale, given the cost is not a constraint.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Key considerations when making a decision on a Cloud Data Warehouse

Cloudera

MAY 17, 2021

Making a decision on a cloud data warehouse is a big deal. Modernizing your data warehousing experience with the cloud means moving from dedicated, on-premises hardware focused on traditional relational analytics on structured data to a modern platform.

Data Warehouse

Data Warehouse Cloud Government Metadata

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Snowflake was founded in 2012 around its data warehouse product, which is still its core offering, and Databricks was founded in 2013 from academia with Spark co-creator researchers, becoming Apache Spark in 2014. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table.

Metadata

Metadata Data Warehouse BI MySQL

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. Iceberg basics Iceberg is an open table format designed for large analytic workloads.

Data Warehouse

Data Warehouse Java Metadata Data

Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Cloudera

SEPTEMBER 24, 2020

Some of the most powerful results come from combining complementary superpowers, and the “dynamic duo” of Apache Hive LLAP and Apache Impala, both included in Cloudera Data Warehouse , is further evidence of this. Both Impala and Hive can operate at an unprecedented and massive scale, with many petabytes of data.

Data Warehouse

Data Warehouse SQL Engineering Metadata

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Compute Engines: Tools that query and process data stored in Iceberg tables (e.g.,

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? First, we create an Iceberg table in Snowflake and then insert some data.

Architecture

Architecture Systems Data Lake Google Cloud

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

Today’s customers have a growing need for a faster end to end data ingestion to meet the expected speed of insights and overall business demand. This ‘need for speed’ drives a rethink on building a more modern data warehouse solution, one that balances speed with platform cost management, performance, and reliability.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own. Sure, there’s a need to abstract the complexity of data processing, computation and storage.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

The Downfall of the Data Engineer

Maxime Beauchemin

AUGUST 28, 2017

Consensus seeking Whether you think that old-school data warehousing concepts are fading or not, the quest to achieve conformed dimensions and conformed metrics is as relevant as it ever was. The data warehouse needs to reflect the business, and the business should have clarity on how it thinks about analytics.

Data Engineering

Data Engineering Data Engineer Engineering Software Engineer

Enabling Self-Service Business Insights with Cloudera Data Warehouse

Cloudera

JANUARY 11, 2021

How self-service data warehousing frees IT resources. Cloudera Data Warehouse (CDW) is a cloud service and an integral part of the newly released Cloudera Data Platform (CDP). Key features are: Highly scalable and performant open-source engines for BI and data warehousing workloads. Simplified provisioning.

Data Warehouse

Data Warehouse Pharmaceutical Data Lake BI

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. Atlan is the metadata hub for your data ecosystem.

Systems

Systems Metadata Data Pipeline MongoDB

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Note that where a TRUNCATE PARTITION is typically a “free” metadata operation, a DELETE operation may be expensive and that should be taken into considerations. This means that ideally the logic in source control describes how to build the full state of the data warehouse throughout all time periods.

Data Process

Data Process Data Engineering Data Engineer Process

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

The most commonly used one is dataflow project , which helps folks in managing their data pipeline repositories through creation, testing, deployment and few other activities. It lets you create YAML formatted mock data files based on selected tables, columns and a few rows of data from the Netflix data warehouse.

Data Pipeline

Data Pipeline Scala Metadata Food

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

This is part 2 in this blog series. You can read part 1, here: Digital Transformation is a Data Journey From Edge to Insight. The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

Data volume and velocity, governance, structure, and regulatory requirements have all evolved and continue to. Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Altus Data Warehouse

Cloudera

SEPTEMBER 9, 2018

We are proud to announce the general availability of Cloudera Altus Data Warehouse , the only cloud data warehousing service that brings the warehouse to the data. Modern data warehousing for the cloud. Modern data warehousing for the cloud. Using Cloudera Altus for your cloud data warehouse.

Data Warehouse

Data Warehouse Metadata Cloud Storage Cloud

Real World Change Data Capture At Datacoral

Data Engineering Podcast

MARCH 22, 2021

Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. RudderStack’s smart customer data pipeline is warehouse-first.

Data Warehouse

Data Warehouse Metadata Hadoop Data Lake

A decade of scaling (real-time) analytics and master data at Picnic

Picnic Engineering

MARCH 28, 2025

Our investments in a lakeless data warehouse, modern analytics platform, and strong master data practices have made data a core strategic capability. This blog reflects on key milestones, cultural shifts, and personal growth along theway. Our journey is well-documented in Picnic tech blog posts.

Data Warehouse

Data Warehouse PostgreSQL Python Machine Learning

Best Practices for Migrating Historical Data to Snowflake

Snowflake

NOVEMBER 30, 2023

At TCS , we help companies shift their enterprise data warehouse (EDW) platforms to the cloud as well as offering IT services. We’re extremely familiar with just how tricky a cloud migration can be, especially when it involves moving historical business data. Use separate data warehouses for cost-effective data loading.

Data Warehouse

Data Warehouse Banking Data Cloud

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

The blog highlights the advantages of GNN over traditional machine learning models, which struggle to discern relationships between various entities, such as users and restaurants, and edges, such as order. The author highlights Paimon’s consistency model by examining the metadata model.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. Starting from the CDW Public Cloud DWX-1.6.1

Metadata

Metadata Data Warehouse BI AWS

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Cloudera

AUGUST 31, 2021

Cloudera and Accenture demonstrate strength in their relationship with an accelerator called the Smart Data Transition Toolkit for migration of legacy data warehouses into Cloudera Data Platform. Accenture’s Smart Data Transition Toolkit . Are you looking for your data warehouse to support the hybrid multi-cloud?

Data Warehouse

Data Warehouse Database-centric Metadata Cloud

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Since the value of data quickly drops over time, organizations need a way to analyze data as it is generated. To avoid disruptions to operational databases, companies typically replicate data to data warehouses for analysis.

IT

IT Data Lake Data Warehouse Relational Database

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Data engineering inherits from years of data practices in US big companies. Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center. My advice on this point is to learn from others.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera

NOVEMBER 12, 2020

Your sunk costs are minimal and if a workload or project you are supporting becomes irrelevant, you can quickly spin down your cloud data warehouses and not be “stuck” with unused infrastructure. Cloud deployments for suitable workloads gives you the agility to keep pace with rapidly changing business and data needs.

Cloud

Cloud Data Warehouse Banking Data

Data Quality Score: The next chapter of data quality at Airbnb

Airbnb Tech

NOVEMBER 28, 2023

However, for all of our uncertified data, which remained the majority of our offline data, we lacked visibility into its quality and didn’t have clear mechanisms for up-leveling it. How could we scale the hard-fought wins and best practices of Midas across our entire data warehouse?

Data Warehouse

Data Warehouse Metadata Data Certification

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

While cloud-native, point-solution data warehouse services may serve your immediate business needs, there are dangers to the corporation as a whole when you do your own IT this way. Cloudera Data Warehouse (CDW) is here to save the day! CDW is an integrated data warehouse service within Cloudera Data Platform (CDP).

IT

IT Data Lake Data Warehouse Cloud Storage

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. In recent years, the term “data lakehouse” was coined to describe this architectural pattern of tabular analytics over data in the data lake.

Data Lake

Data Lake Data Warehouse BI SQL

The Security Challenges of Data Warehousing in the Cloud

Cloudera

NOVEMBER 5, 2020

Many organizations struggle to meet growing and variable data warehouse demands. This is exactly what Cloudera Data Platform (CDP) provides to the Cloudera Data Warehouse. CDP is a data platform that is optimized for both business units and central IT. . Cloudera Data Warehouse Security.

Cloud

Cloud Data Lake Data Warehouse Metadata

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

analyst Sumit Pal, in “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022: “Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support AI, BI, ML, and data engineering on a single platform.” Iceberg handles massive data born in the cloud.

Architecture

Architecture Metadata Machine Learning Unstructured Data

From Big Data to Better Data: Ensuring Data Quality with Verity

Lyft Engineering

OCTOBER 3, 2023

In this post we will define data quality at a high-level and explore our motivation to achieve better data quality. We will then introduce our in-house product, Verity, and showcase how it serves as a central platform for ensuring data quality in our Hive Data Warehouse. What and Where is Data Quality?

Big Data

Big Data Metadata Data Warehouse Data

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

Cloudera

JANUARY 19, 2022

Cloudera Data Platform (CDP) scored among the top 10 vendors on all four Analytical Use Cases — Data Warehouse, Logical Data Warehouse, Data Lake and Operational Intelligence in the Critical Capabilities for Cloud Database Management Systems for Analytics Use Cases.

Database

Database Cloud Data Warehouse Data Lake

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Why is Data Quality Expensive? I won’t bore you with the importance of data quality in the blog. Event Routers can add additional metadata to the envelope of the event. How to Fix It?

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

With in-place table migration, you can rapidly convert to Iceberg tables since there is no need to regenerate data files. Only metadata will be regenerated. Newly generated metadata will then point to source data files as illustrated in the diagram below. . Data quality using table rollback. Metadata management .

Cloud

Cloud Metadata Data Warehouse Google Cloud

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

So, embrace the power of Change Data Capture, and embark on a captivating journey where the magic of real-time data awaits. In this blog, we will cover: What Is CDC and Its Benefits? CDC also plays a crucial role in data integration and ETL processes. Where Is CDC Used and Who Uses It?

Telecommunication

Telecommunication Metadata Healthcare Finance

How Meta discovers data flows via lineage at scale

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Webinars

Trending Sources

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Webinars

Databook: Turning Big Data into Knowledge with Metadata at Uber

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

How Meta understands data at scale

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Optimizing data warehouse storage

Data Engineering Weekly #198

Key considerations when making a decision on a Cloud Data Warehouse

Databricks, Snowflake and the future

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Choosing the right Data Warehouse SQL Engine: Apache Hive LLAP vs Apache Impala

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Why Open Table Format Architecture is Essential for Modern Data Systems

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

The Rise of the Data Engineer

The Downfall of the Data Engineer

Enabling Self-Service Business Insights with Cloudera Data Warehouse

A Look At The Data Systems Behind The Gameplay For League Of Legends

Functional Data Engineering — a modern paradigm for batch data processing

Ready-to-go sample data pipelines with Dataflow

Next Stop – Building a Data Pipeline from Edge to Insight

Data Lake vs. Data Warehouse vs. Data Lakehouse

Altus Data Warehouse

Real World Change Data Capture At Datacoral

A decade of scaling (real-time) analytics and master data at Picnic

Best Practices for Migrating Historical Data to Snowflake

Data Engineering Weekly #179

Materialized Views in Hive for Iceberg Table Format

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Change Data Capture (CDC): What it is and How it Works

How to learn data engineering

Extreme data center pressure? Burst to the cloud with CDP!

Data Quality Score: The next chapter of data quality at Airbnb

Get Your Analytics Insights Instantly – Without Abandoning Central IT

The Future of the Data Lakehouse – Open

The Security Challenges of Data Warehousing in the Cloud

The Modern Data Lakehouse: An Architectural Innovation

From Big Data to Better Data: Ensuring Data Quality with Verity

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Unleashing the Power of CDC With Snowflake

Stay Connected