Data Warehouse, Metadata and Raw Data - Data Engineering Digest

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. This switch has been lead by modern data stack vision.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Meta joins the Data Transfer Project and has continuously led the development of shared technologies that enable users to port their data from one platform to another. 2024: Users can access data logs in Download Your Information. What are data logs?

Accessible

Accessible Accessibility Raw Data Data Warehouse

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Snowflake was founded in 2012 around its data warehouse product, which is still its core offering, and Databricks was founded in 2013 from academia with Spark co-creator researchers, becoming Apache Spark in 2014. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table.

Metadata

Metadata Data Warehouse BI MySQL

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

The Downfall of the Data Engineer

Maxime Beauchemin

AUGUST 28, 2017

Consensus seeking Whether you think that old-school data warehousing concepts are fading or not, the quest to achieve conformed dimensions and conformed metrics is as relevant as it ever was. The data warehouse needs to reflect the business, and the business should have clarity on how it thinks about analytics.

Data Engineer

Data Engineer Data Engineering Engineering Software Engineering

5 Helpful Extract & Load Practices for High-Quality Raw Data

Meltano

DECEMBER 7, 2022

Setting the Stage: We need E&L practices, because “copying raw data” is more complex than it sounds. For instance, how would you know which orders got “canceled”, an operation that usually takes place in the same data record and just “modifies” it in place.

Raw Data

Raw Data Metadata Data Database

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

While business rules evolve constantly, and while corrections and adjustments to the process are more the rule than the exception, it’s important to insulate compute logic changes from data changes and have control over all of the moving parts. But how do we model this in a functional data warehouse without mutating data?

Data Engineering

Data Engineering Data Engineer Data Process Process

The Hidden Threats in Your Data Warehouse Layers (And How to Fix Them)

Monte Carlo

AUGUST 6, 2024

Data warehouses are the centralized repositories that store and manage data from various sources. They are integral to an organization’s data strategy, ensuring data accessibility, accuracy, and utility. However, beneath their surface lies a host of invisible risks embedded within the data warehouse layers.

Data Warehouse

Data Warehouse Raw Data Machine Learning BI

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Podcast

DECEMBER 16, 2019

The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools.

Metadata

Metadata PostgreSQL Datasets Data Warehouse

Data Lakes vs. Data Warehouses

Grouparoo

JANUARY 11, 2022

This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle. There are two main options available, a data lake and a data warehouse. What is a Data Warehouse? What is a Data Lake?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

Data volume and velocity, governance, structure, and regulatory requirements have all evolved and continue to. Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

As you do not want to start your development with uncertainty, you decide to go for the operational raw data directly. Accessing Operational Data I used to connect to views in transactional databases or APIs offered by operational systems to request the raw data. Does it sound familiar?

Systems

Systems Raw Data Metadata Data Cleanse

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

Different vendors offering data warehouses, data lakes, and now data lakehouses all offer their own distinct advantages and disadvantages for data teams to consider. So let’s get to the bottom of the big question: what kind of data storage layer will provide the strongest foundation for your data platform?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Best Practices for Migrating Historical Data to Snowflake

Snowflake

NOVEMBER 30, 2023

At TCS , we help companies shift their enterprise data warehouse (EDW) platforms to the cloud as well as offering IT services. We’re extremely familiar with just how tricky a cloud migration can be, especially when it involves moving historical business data. How many tables and views will be migrated, and how much raw data?

Data Warehouse

Data Warehouse Banking Data Cloud

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

While cloud-native, point-solution data warehouse services may serve your immediate business needs, there are dangers to the corporation as a whole when you do your own IT this way. Cloudera Data Warehouse (CDW) is here to save the day! CDW is an integrated data warehouse service within Cloudera Data Platform (CDP).

IT

IT Data Lake Data Warehouse Cloud Storage

5 Big Data Challenges in 2024

Knowledge Hut

MARCH 7, 2024

The greatest data processing challenge of 2024 is the lack of qualified data scientists with the skill set and expertise to handle this gigantic volume of data. Inability to process large volumes of data Out of the 2.5 quintillion data produced, only 60 percent workers spend days on it to make sense of it.

Big Data

Big Data Bytes Data Governance Raw Data

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

But this data is not that easy to manage since a lot of the data that we produce today is unstructured. In fact, 95% of organizations acknowledge the need to manage unstructured raw data since it is challenging and expensive to manage and analyze, which makes it a major concern for most businesses. How Does AWS Glue Work?

AWS

AWS Scala Metadata Data Lake

Building a Data Platform in 2024

Towards Data Science

FEBRUARY 9, 2024

Data Store Another significant change from 2021 to 2024 lies in the shift from “Data Warehouse” to “Data Store,” acknowledging the expanding database horizon, including the rise of Data Lakes. Their robust core offering seamlessly integrates data warehouses with data-hungry applications.

Building

Building Transportation Data Lake Metadata

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

Collecting, cleaning, and organizing data into a coherent form for business users to consume are all standard data modeling and data engineering tasks for loading a data warehouse. Based on Tecton blog So is this similar to data engineering pipelines into a data lake/warehouse?

Engineering

Engineering Raw Data Data Science Machine Learning

Data Vault Architecture, Data Quality Challenges, And How To Solve Them

Monte Carlo

FEBRUARY 9, 2023

Over the past several years, data warehouses have evolved dramatically, but that doesn’t mean the fundamentals underpinning sound data architecture needs to be thrown out the window. What is a Data Vault model? The data vault paradigm addresses the desire to overlay organization on top of semi-permanent raw data storage.

Architecture

Architecture Raw Data Metadata Data Warehouse

Data Lakehouse: Concept, Key Features, and Architecture Layers

AltexSoft

NOVEMBER 10, 2021

The pun being obvious, there’s more to that than just a new term: Data lakehouses combine the best features of both data lakes and data warehouses and this post will explain this all. What is a data lakehouse? Data warehouse vs data lake vs data lakehouse: What’s the difference.

Architecture

Architecture Data Lake Data Warehouse Metadata

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Monte Carlo

NOVEMBER 14, 2023

Databricks announced that Delta tables metadata will also be compatible with the Iceberg format, and Snowflake has also been moving aggressively to integrate with Iceberg. It is designed to be easily queryable with SQL even for large analytic tables (we’re talking petabytes of data). How Apache Iceberg tables structure metadata.

Data Lake

Data Lake Metadata Data Warehouse SQL

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. This article explains what a data lake is, its architecture, and diverse use cases. Data warehouse vs. data lake in a nutshell.

Data Lake

Data Lake Architecture IT Amazon Web Services

Mastering the Art of ETL on AWS for Data Management

ProjectPro

FEBRUARY 16, 2023

With so much riding on the efficiency of ETL processes for data engineering teams, it is essential to take a deep dive into the complex world of ETL on AWS to take your data management to the next level. ETL has typically been carried out utilizing data warehouses and on-premise ETL tools.

AWS

AWS Data Management ETL Tools Management

Moving Past ETL and ELT: Understanding the EtLT Approach

Ascend.io

AUGUST 31, 2023

Secondly , the rise of data lakes that catalyzed the transition from ELT to ELT and paved the way for niche paradigms such as Reverse ETL and Zero-ETL. Still, these methods have been overshadowed by EtLT — the predominant approach reshaping today’s data landscape. Read More: What is ETL?

Data Lake

Data Lake Data Warehouse ETL Tools Data Pipeline

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption. Databricks Data Catalog and AWS Lake Formation are examples in this vein. See our post: Data Lakes vs. Data Warehouses.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

The data products are packaged around the business needs and in support of the business use cases. This step requires curation, harmonization, and standardization from the raw data into the products. Luke: Let’s talk about some of the fundamentals of modern data architecture. What is a data fabric?

Data Lake

Data Lake Analytics Application Cloud Storage Architecture

Modernizing Data Warehousing with Snowflake and Hybrid Data Vault

Snowflake

APRIL 5, 2023

Two different data modeling approaches—dimensional data modeling and Data Vault—each have their own pros and cons. Modernizing a data warehouse with Snowflake Data Cloud is a smart investment that can provide significant benefits to businesses of all sizes, today more than ever as data models become ever more complex.

Data Warehouse

Data Warehouse Healthcare Unstructured Data Metadata

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

This week, we got to think about our data ingestion design. We looked at the following: How do we ingest – ETL vs ELT Where do we store the data – Data lake vs data warehouse Which tool to we use to ingest – cronjob vs workflow engine NOTE : This weeks task requires good internet speed and good compute.

Data Ingestion

Data Ingestion Data Engineering Data Engineer Engineering

Data Curation Explained: How To Make Data More Valuable

Monte Carlo

JULY 25, 2023

What is data curation? Data curation is the process of transforming and enriching larger amounts of raw data into smaller, more widely accessible subsets of data that provide additional value to the organization or the intended use case. It would also make the data engineering team a bottleneck.

Raw Data

Raw Data Data Warehouse Data Architecture

Monte Carlo Announces Delta Lake, Unity Catalog Integrations To Bring End-to-End Data Observability to Databricks

Monte Carlo

JUNE 28, 2022

Over the past several years, cloud data lakes like Databricks have gotten so powerful (and popular) that according to Mordor Intelligence , the data lake market is expected to grow from $3.74 Traditionally, data lakes held raw data in its native format and were known for their flexibility, speed, and open source ecosystem.

Data Lake

Data Lake Metadata AWS Data Warehouse

Data Cloud Deployment Framework: Architecture

Cloudyard

MARCH 4, 2023

Secondly, Define Business Rules : Develop the transformation on RAW data and include the Business logic. Develop the relationship among different sources table to produce meaningful data. Thirdly, Data Consumption: Develop the Views on Transformed or aggregated tables. Snowpipe to automate the ingestion process.

Architecture

Architecture Cloud Metadata Data Ingestion

What is Data Lineage?

Databand.ai

JULY 28, 2022

Those tasks are saving the extracted data in the data lake (or warehouse or lakehouse). Other tasks, probably SQL jobs orchestrated with DBT, are running transformation on the loaded data. They are querying raw data tables, enriching them, joining between tables, and creating business data – all ready to be used.

Metadata

Metadata Data Lake Datasets Data Warehouse

Available Now! Automated Testing for Data Transformations

Wayne Yaddow

FEBRUARY 18, 2025

Selecting the strategies and tools for validating data transformations and data conversions in your data pipelines. Introduction Data transformations and data conversions are crucial to ensure that raw data is organized, processed, and ready for useful analysis.

Data Pipeline

Data Pipeline SQL Raw Data Python

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataKitchen

SEPTEMBER 21, 2023

Data in Place refers to the organized structuring and storage of data within a specific storage medium, be it a database, bucket store, files, or other storage platforms. In the contemporary data landscape, data teams commonly utilize data warehouses or lakes to arrange their data into L1, L2, and L3 layers.

Raw Data

Raw Data Data Business Intelligence Data Engineering

5 Reasons Data Discovery Platforms Are Best For Data Lakes

Monte Carlo

APRIL 1, 2021

is whether to choose a data warehouse or lake to power storage and compute for their analytics. While data warehouses provide structure that makes it easy for data teams to efficiently operationalize data (i.e., Data discovery tools and platforms can help. Image courtesy of Adrian on Unsplash.

Data Lake

Data Lake Data Warehouse Unstructured Data Government

Data Engineering Weekly #114

Data Engineering Weekly

JANUARY 15, 2023

SiliconANGLE theCUBE: Analyst Predictions 2023 - The Future of Data Management By far one of the best analyses of trends in Data Management. 2023 predictions from the panel are; Unified metadata becomes kingmaker. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost-effective solution.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

AltexSoft

SEPTEMBER 23, 2021

One of the innovative ways to address this problem is to build a data hub — a platform that unites all your information sources under a single umbrella. This article explains the main concepts of a data hub, its architecture, and how it differs from data warehouses and data lakes. What is Data Hub?

Architecture

Architecture Data Lake Unstructured Data Data Warehouse

What is ETL Pipeline? Process, Considerations, and Examples

ProjectPro

NOVEMBER 30, 2021

Now that we have understood how much significant role data plays, it opens the way to a set of more questions like How do we acquire or extract raw data from the source? How do we transform this data to get valuable insights from it? Where do we finally store or load the transformed data?

Process

Process Data Warehouse Data Pipeline AWS

Snowflake Architecture and It's Fundamental Concepts

ProjectPro

JANUARY 31, 2022

As the demand for big data grows, an increasing number of businesses are turning to cloud data warehouses. The cloud is the only platform to handle today's colossal data volumes because of its flexibility and scalability. Launched in 2014, Snowflake is one of the most popular cloud data solutions on the market.

Architecture

Architecture IT Data Warehouse Amazon Web Services

ELT Process: Key Components, Benefits, and Tools to Build ELT Pipelines

AltexSoft

DECEMBER 23, 2022

It is a data integration process with which you first extract raw information (in its original formats) from various sources and load it straight into a central repository such as a cloud data warehouse , a data lake , or a data lakehouse where you transform it into suitable formats for further analysis and reporting.

Process

Process Building Raw Data Data Lake

How to Simplify Data Pipelines with DBT and Airflow?

Workfall

AUGUST 14, 2023

DBT, which stands for Data Build Tool, is a powerful tool designed to transform and manage data in a scalable and reproducible manner. It allows data engineers to define and execute data transformations in a structured and modular way. Initialize the Airflow metadata database by running airflow initdb in your terminal.

Data Pipeline

Data Pipeline Data Raw Data Database

Data Collection for Machine Learning: Steps, Methods, and Best Practices

AltexSoft

JUNE 26, 2023

Data collection revolves around gathering raw data from various sources, with the objective of using it for analysis and decision-making. It includes manual data entries, online surveys, extracting information from documents and databases, capturing signals from sensors, and more.

Data Collection

Data Collection Machine Learning Unstructured Data Non-relational Database

5 Use Cases for Vector Search

Rockset

MAY 8, 2023

Other models were also evaluated including BERT , a model trained on a big corpus of text data, but found that BERT was better suited for word embeddings than sentence embeddings and was pre-trained only in English. Due to the large number of listings at eBay, the data is loaded in batches to HDFS, eBay’s data warehouse.

Metadata

Metadata Algorithm Datasets Google Cloud

How to get started with dbt

Data logs: The latest evolution in Meta’s access tools

Webinars

Trending Sources

Databricks, Snowflake and the future

Webinars

The Downfall of the Data Engineer

5 Helpful Extract & Load Practices for High-Quality Raw Data

Functional Data Engineering — a modern paradigm for batch data processing

The Hidden Threats in Your Data Warehouse Layers (And How to Fix Them)

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Lakes vs. Data Warehouses

Data Lake vs. Data Warehouse vs. Data Lakehouse

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Best Practices for Migrating Historical Data to Snowflake

Get Your Analytics Insights Instantly – Without Abandoning Central IT

5 Big Data Challenges in 2024

Data Lake vs Data Warehouse - Working Together in the Cloud

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Building a Data Platform in 2024

Data Vault on Snowflake: Feature Engineering and Business Vault

Data Vault Architecture, Data Quality Challenges, And How To Solve Them

Data Lakehouse: Concept, Key Features, and Architecture Layers

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Mastering the Art of ETL on AWS for Data Management

Moving Past ETL and ELT: Understanding the EtLT Approach

Top Data Lake Vendors (Quick Reference Guide)

Demystifying Modern Data Platforms

Modernizing Data Warehousing with Snowflake and Hybrid Data Vault

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Data Curation Explained: How To Make Data More Valuable

Monte Carlo Announces Delta Lake, Unity Catalog Integrations To Bring End-to-End Data Observability to Databricks

Data Cloud Deployment Framework: Architecture

What is Data Lineage?

Available Now! Automated Testing for Data Transformations

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

5 Reasons Data Discovery Platforms Are Best For Data Lakes

Data Engineering Weekly #114

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

What is ETL Pipeline? Process, Considerations, and Examples

Snowflake Architecture and It's Fundamental Concepts

ELT Process: Key Components, Benefits, and Tools to Build ELT Pipelines

How to Simplify Data Pipelines with DBT and Airflow?

Data Collection for Machine Learning: Steps, Methods, and Best Practices

5 Use Cases for Vector Search

Stay Connected