Data Warehouse and Metadata - Data Engineering Digest

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. This switch has been lead by modern data stack vision.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Metadata Cloud Storage Data Warehouse

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Cloudera

SEPTEMBER 29, 2020

Performance is one of the key, if not the most important deciding criterion, in choosing a Cloud Data Warehouse service. In today’s fast changing world, enterprises have to make data driven decisions quickly and for that they rely heavily on their data warehouse service. . Cloudera Data Warehouse vs HDInsight.

Data Warehouse

Data Warehouse Cloud Storage Metadata Cloud

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

These stages propagate through various systems including function-based systems that load, process, and propagate data through stacks of function calls in different programming languages (e.g., For simplicity, we will demonstrate these for the web, the data warehouse, and AI, per the diagram below. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

Data News — Week 24.11

Christophe Blefari

MARCH 15, 2024

Attributing Snowflake cost to whom it belongs — Fernando gives ideas about metadata management to attribute better Snowflake cost. This is Croissant. Starting today it will be supported by 3 majors platforms: Kaggle, HuggingFace and OpenML.

Metadata

Metadata Data Datasets Data Warehouse

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Data Engineering Podcast

NOVEMBER 10, 2021

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. Start trusting your data with Monte Carlo today! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Metadata

Metadata Data Warehouse Data Lake BI

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Cloudera

DECEMBER 11, 2020

In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to EMR 6.0 (also powered by Apache Hive-LLAP) on Amazon using the TPC-DS 2.9 Cloudera Data Warehouse vs EMR. Learn more about Cloudera Data Warehouse on CDP.

Data Warehouse

Data Warehouse Metadata Datasets Machine Learning

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

AUGUST 3, 2018

Data powers Uber’s global marketplace, enabling more reliable and seamless user experiences across our products for riders, … The post Databook: Turning Big Data into Knowledge with Metadata at Uber appeared first on Uber Engineering Blog.

Metadata

Metadata Big Data Transportation Data

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

By Anupom Syam Background At Netflix, our current data warehouse contains hundreds of Petabytes of data stored in AWS S3 , and each day we ingest and create additional Petabytes. Some of the optimizations are prerequisites for a high-performance data warehouse.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Keeping Your Data Warehouse In Order With DataForm

Data Engineering Podcast

OCTOBER 14, 2019

Summary Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. What are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data?

Data Warehouse

Data Warehouse PostgreSQL AWS Programming Language

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

OCTOBER 15, 2021

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. No more scripts, just SQL.

Metadata

Metadata BI Data Warehouse Government

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Cloudera

JANUARY 15, 2021

Cloud data warehouses allow users to run analytic workloads with greater agility, better isolation and scale, and lower administrative overhead than ever before. The results demonstrate superior price performance of Cloudera Data Warehouse on the full set of 99 queries from the TPC-DS benchmark. Introduction.

Data Warehouse

Data Warehouse Cloud Consulting SQL

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Netflix Tech

OCTOBER 27, 2020

Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto , to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3.

Data Warehouse

Data Warehouse Datasets Data Big Data

Key considerations when making a decision on a Cloud Data Warehouse

Cloudera

MAY 17, 2021

Making a decision on a cloud data warehouse is a big deal. Modernizing your data warehousing experience with the cloud means moving from dedicated, on-premises hardware focused on traditional relational analytics on structured data to a modern platform.

Data Warehouse

Data Warehouse Cloud Government Metadata

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Snowflake was founded in 2012 around its data warehouse product, which is still its core offering, and Databricks was founded in 2013 from academia with Spark co-creator researchers, becoming Apache Spark in 2014. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table.

Metadata

Metadata Data Warehouse BI MySQL

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses. Go to [dataengineeringpodcast.com/materialize]([link] Support Data Engineering Podcast

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. Try Cloudera Data Warehouse (CDW) by signing up for a 60 day trial , or test drive CDP.

Data Warehouse

Data Warehouse Java Metadata Data

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

The trend to centralize data will accelerate, making sure that data is high-quality, accurate and well managed. Overall, data must be easily accessible to AI systems, with clear metadata management and a focus on relevance and timeliness.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Meta joins the Data Transfer Project and has continuously led the development of shared technologies that enable users to port their data from one platform to another. 2024: Users can access data logs in Download Your Information. What are data logs?

Accessible

Accessible Accessibility Raw Data Data Warehouse

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

link] Jon Osborn: Best Practices for Using QUERY_TAG in Snowflake The modern data warehouses are good at running at scale, given the cost is not a constraint. The service offers configurable counter types optimized for various use cases with a unified Control Plane configuration. I’ve seen a similar work by Ben E.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

Today’s customers have a growing need for a faster end to end data ingestion to meet the expected speed of insights and overall business demand. This ‘need for speed’ drives a rethink on building a more modern data warehouse solution, one that balances speed with platform cost management, performance, and reliability.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 19, 2023

Summary Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used.

IT

IT Data Lake Metadata Data Warehouse

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day.

Metadata

Metadata MongoDB MySQL Scala

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Compute Engines: Tools that query and process data stored in Iceberg tables (e.g., Maintenance Processes: Operations that optimize Iceberg tables, such as compacting small files and managing metadata. Trino, Spark, Snowflake, DuckDB).

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

First, we create an Iceberg table in Snowflake and then insert some data. Then, we add another column called HASHKEY , add more data, and locate the S3 file containing metadata for the iceberg table. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history.

Architecture

Architecture Systems Data Lake Google Cloud

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

Data modeling is changing Typical data modeling techniques — like the star schema — which defined our approach to data modeling for the analytics workloads typically associated with data warehouses, are less relevant than they once were.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Missing data? Stale dashboards?

Metadata

Metadata Business Intelligence Data Lake BI

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

DECEMBER 29, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Missing data? Stale dashboards?

Management

Management Metadata Business Intelligence Data Lake

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. With this 3rd platform generation, you have more real time data analytics and a cost reduction because it is easier to manage this infrastructure in the cloud thanks to managed services. What you have to code is this workflow !

Technology

Technology Architecture Google Cloud Metadata

The High Cost of Poor Data Warehouse Governance

Monte Carlo

SEPTEMBER 10, 2024

This truth was hammered home recently when ride-hailing giant Uber found itself on the receiving end of a staggering €290 million ($324 million) fine from the Dutch Data Protection Authority. Poor data warehouse governance practices that led to the improper handling of sensitive European driver data. The reason?

Data Warehouse

Data Warehouse Government Data Governance Metadata

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

Data volume and velocity, governance, structure, and regulatory requirements have all evolved and continue to. Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Data Engineering Podcast

JUNE 19, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Metadata

Metadata Unstructured Data MongoDB MySQL

Cost Conscious Data Warehousing with Cloudera Data Platform

Cloudera

DECEMBER 10, 2020

Why worry about costs with cloud-native data warehousing? Have you been burned by the unexpected costs of a cloud data warehouse? If not, before adopting a cloud data warehouse, consider the true costs of a cloud-native data warehouse. These costs impede the adoption of cloud-native data warehouses.

Data Warehouse

Data Warehouse Cloud Storage Metadata Data

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Since the value of data quickly drops over time, organizations need a way to analyze data as it is generated. To avoid disruptions to operational databases, companies typically replicate data to data warehouses for analysis.

IT

IT Data Lake Data Warehouse Relational Database

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

The most commonly used one is dataflow project , which helps folks in managing their data pipeline repositories through creation, testing, deployment and few other activities. It lets you create YAML formatted mock data files based on selected tables, columns and a few rows of data from the Netflix data warehouse.

Data Pipeline

Data Pipeline Scala Metadata Food

The Downfall of the Data Engineer

Maxime Beauchemin

AUGUST 28, 2017

Consensus seeking Whether you think that old-school data warehousing concepts are fading or not, the quest to achieve conformed dimensions and conformed metrics is as relevant as it ever was. The data warehouse needs to reflect the business, and the business should have clarity on how it thinks about analytics.

Data Engineering

Data Engineering Data Engineer Engineering Software Engineer

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Data Engineering Podcast

JUNE 26, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Datasets

Datasets Unstructured Data Metadata MongoDB

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Data Engineering Podcast

DECEMBER 19, 2021

Summary Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. No more scripts, just SQL.

Systems

Systems Building Metadata Data Warehouse

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Note that where a TRUNCATE PARTITION is typically a “free” metadata operation, a DELETE operation may be expensive and that should be taken into considerations. This means that ideally the logic in source control describes how to build the full state of the data warehouse throughout all time periods.

Data Engineering

Data Engineering Data Engineer Data Process Process

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

Data Engineering Podcast

JUNE 17, 2021

In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

Unstructured Data

Unstructured Data Data Warehouse Metadata Media

Making The Total Cost Of Ownership For External Data Manageable With Crux

Data Engineering Podcast

JULY 17, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Data Management

Data Management Management Metadata MongoDB

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it.

Systems

Systems Metadata Data Pipeline MongoDB

Stop Overcomplicating Data Quality

Towards Data Science

DECEMBER 10, 2024

Take advantage of old school databasetricks In the last 1015 years weve seen massive changes to the data industry, notably big data, parallel processing, cloud computing, data warehouses, and new tools (lots and lots of newtools). Consequently, weve had to say goodbye to some things to make room for all this new stuff.

PostgreSQL

PostgreSQL Data Python SQL

How to get started with dbt

How Apache Iceberg Is Changing the Face of Data Lakes

Webinars

Trending Sources

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Webinars

How Meta discovers data flows via lineage at scale

Data News — Week 24.11

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Databook: Turning Big Data into Knowledge with Metadata at Uber

Optimizing data warehouse storage

Keeping Your Data Warehouse In Order With DataForm

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Key considerations when making a decision on a Cloud Data Warehouse

Databricks, Snowflake and the future

Reflecting On The Past 6 Years Of Data Engineering

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Data logs: The latest evolution in Meta’s access tools

Data Engineering Weekly #198

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Why Open Table Format Architecture is Essential for Modern Data Systems

The Rise of the Data Engineer

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Toward a Data Mesh (part 2) : Architecture & Technologies

The High Cost of Poor Data Warehouse Governance

Data Lake vs. Data Warehouse vs. Data Lakehouse

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Cost Conscious Data Warehousing with Cloudera Data Platform

Change Data Capture (CDC): What it is and How it Works

Ready-to-go sample data pipelines with Dataflow

The Downfall of the Data Engineer

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Functional Data Engineering — a modern paradigm for batch data processing

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

Making The Total Cost Of Ownership For External Data Manageable With Crux

A Look At The Data Systems Behind The Gameplay For League Of Legends

Stop Overcomplicating Data Quality

Stay Connected