Data Lake, Data Storage and Metadata - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Though basic and easy to use, traditional table storage formats struggle to keep up. Track data files within the table along with their column statistics.

Architecture

Architecture Systems Data Lake Google Cloud

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

With the amount of data companies are using growing to unprecedented levels, organizations are grappling with the challenge of efficiently managing and deriving insights from these vast volumes of structured and unstructured data. What is a Data Lake? Consistency of data throughout the data lake.

Data Lake

Data Lake Process Metadata Data Warehouse

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

A brief history of data storage The value of data has been apparent for as long as people have been writing things down. While data warehouses are still in use, they are limited in use-cases as they only support structured data. A few big tech companies have the in-house expertise to customize their own data lakes.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Data Lakes vs. Data Warehouses

Grouparoo

JANUARY 11, 2022

This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle. There are two main options available, a data lake and a data warehouse. What is a Data Warehouse? What is a Data Lake?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Data lakes are useful, flexible data storage repositories that enable many types of data to be stored in its rawest state. Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. What is a data lake?

Data Lake

Data Lake Architecture IT Amazon Web Services

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

That’s why it’s essential for teams to choose the right architecture for the storage layer of their data stack. But, the options for data storage are evolving quickly. So let’s get to the bottom of the big question: what kind of data storage layer will provide the strongest foundation for your data platform?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Iceberg Is An Implementation Detail

dbt Developer Hub

OCTOBER 3, 2024

These formats are changing the way data is stored and metadata accessed. Apache Iceberg is a high-performance open table format developed for modern data lakes. Iceberg Data Catalog - an open-source metadata management system that tracks the schema, partition, and versions of Iceberg tables.

Metadata

Metadata Data Lake Data Storage Accessible

Reflections On Designing A Data Platform From Scratch

Data Engineering Podcast

FEBRUARY 27, 2022

Visit them today at dataengineeringpodcast.com/timescale RudderStack helps you build a customer data platform on your warehouse or data lake. Batch or streaming (acceptable latencies) Data storage (lake or warehouse) How is the data going to be used? That’s Timescale. That’s Timescale.

Designing

Designing Metadata Data Lake Relational Database

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Proprietary file formats mean no one else is invited in!

IT

IT Data Lake Data Warehouse Cloud Storage

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

formats — This is a huge part of data engineering. Picking the right format for your data storage. You'll be seen as the most technical person of a data team and you'll need to help regarding "low-level" stuff you team. You'll be also asked to put in place a data infrastructure.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Hands-On Introduction to Delta Lake with (py)Spark

Towards Data Science

FEBRUARY 15, 2023

Concepts, theory, and functionalities of this modern data storage framework Photo by Nick Fewings on Unsplash Introduction I think it’s now perfectly clear to everybody the value data can have. To use a hyped example, models like ChatGPT could only be built on a huge mountain of data, produced and collected over years.

Data Lake

Data Lake Data Warehouse Hadoop Architecture

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

It offers a simple and efficient solution for data processing in organizations. It offers users a data integration tool that organizes data from many sources, formats it, and stores it in a single repository, such as data lakes, data warehouses, etc., where it can be used to facilitate business decisions.

AWS

AWS Scala Metadata Data Lake

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). Tables are governed as per agreed upon company standards.

Big Data

Big Data Data Management Management Metadata

The Evolution of Table Formats

Monte Carlo

MAY 14, 2024

At its core, a table format is a sophisticated metadata layer that defines, organizes, and interprets multiple underlying data files. Table formats incorporate aspects like columns, rows, data types, and relationships, but can also include information about the structure of the data itself.

Data Lake

Data Lake Metadata Hadoop Data Governance

5 Layers of Data Lakehouse Architecture Explained

Monte Carlo

JANUARY 5, 2024

Data lakehouse architecture combines the benefits of data warehouses and data lakes, bringing together the structure and performance of a data warehouse with the flexibility of a data lake. Storage layer 3. Metadata layer 4. …ok, so maybe they don’t say that. But they should!

Architecture

Architecture Data Lake Metadata Unstructured Data

Data Lakehouse Architecture Explained: 5 Layers

Monte Carlo

JANUARY 5, 2024

Data lakehouse architecture combines the benefits of data warehouses and data lakes, bringing together the structure and performance of a data warehouse with the flexibility of a data lake. Storage layer 3. Metadata layer 4. …ok, so maybe they don’t say that. But they should!

Architecture

Architecture Data Lake Metadata Unstructured Data

Monte Carlo Announces Delta Lake, Unity Catalog Integrations To Bring End-to-End Data Observability to Databricks

Monte Carlo

JUNE 28, 2022

To help organizations realize the full potential of their data lake and lakehouse investments, Monte Carlo, the data observability leader, is proud to announce integrations with Delta Lake and Databricks’ Unity Catalog for full data observability coverage. billion in 2020 to 17.60 billion in 2020 to 17.60

Data Lake

Data Lake Metadata AWS Data Warehouse

Data Lakehouse: Concept, Key Features, and Architecture Layers

AltexSoft

NOVEMBER 10, 2021

The pun being obvious, there’s more to that than just a new term: Data lakehouses combine the best features of both data lakes and data warehouses and this post will explain this all. What is a data lakehouse? Data warehouse vs data lake vs data lakehouse: What’s the difference.

Architecture

Architecture Data Lake Data Warehouse Metadata

Mainframe Optimization: 5 Best Practices to Implement Now

Precisely

JANUARY 25, 2024

Today’s cloud systems excel at high-volume data storage, powerful analytics, AI, and software & systems development. Cloud-based DevOps provides a modern, agile environment for developing and maintaining applications and services that interact with the organization’s mainframe data. Best Practice 2. Best Practice 3.

Metadata

Metadata Relational Database Data Governance Government

Data Engineering Weekly #164

Data Engineering Weekly

MARCH 24, 2024

Dive into Spyne's experience with: - Their search for query acceleration with pre-aggregations and caching - Developing new functionality with Open AI - Optimizing query cost with their data warehouse [link] Suresh Hasuni: Cost Optimization Strategies for Scalable Data Lakehouse Cost is the major concern as the adoption of data lakes increases.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Data Scientist vs Data Engineer: Differences and Why You Need Both

AltexSoft

OCTOBER 30, 2021

Data engineer’s integral task is building and maintaining data infrastructure — the system managing the flow of data from its source to destination. This typically includes setting up two processes: an ETL pipeline , which moves data, and a data storage (typically, a data warehouse ), where it’s kept.

Data Engineering

Data Engineering Data Engineer Engineering Machine Learning

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

AltexSoft

SEPTEMBER 23, 2021

One of the innovative ways to address this problem is to build a data hub — a platform that unites all your information sources under a single umbrella. This article explains the main concepts of a data hub, its architecture, and how it differs from data warehouses and data lakes. What is Data Hub?

Architecture

Architecture Data Lake Unstructured Data Data Warehouse

Costwiz: Saving cost for LinkedIn enterprise on Azure

LinkedIn Engineering

JULY 27, 2023

The landing page lists all the resource recommendations along with metadata around resource owners (Azure security groups), recommendation message, current lifecycle status of the recommendation, due date, assigned engineer, last action message in terms of comments, and a history modal option to check the timeline of actions taken.

Metadata

Metadata Utilities Cloud Database

Iceberg, Right Ahead! 7 Apache Iceberg Best Practices for Smooth Data Sailing

Monte Carlo

MAY 30, 2023

It’s designed to improve upon the performance and usability challenges of older data storage formats such as Apache Hive and Apache Parquet. For example, Monte Carlo can monitor Apache Iceberg tables for data quality incidents, where other data observability platforms may be more limited.

Metadata

Metadata Raw Data Data Lake Data

Data Engineering Annotated Monthly – August 2021

Big Data Tools

SEPTEMBER 6, 2021

There are also several changes in KRaft (namely Revise KRaft Metadata Records and Producer ID generation in KRaft mode ), along with many other changes. Unfortunately, the feature that was most awaited (at least by me) – tiered storage – has been postponed for a subsequent release. Support for Scala 2.12 And more files means more time.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

The Good and the Bad of Hadoop Big Data Framework

AltexSoft

JULY 29, 2022

a runtime environment (sandbox) for classic business intelligence (BI), advanced analysis of large volumes of data, predictive maintenance , and data discovery and exploration; a store for raw data; a tool for large-scale data integration ; and. a suitable technology to implement data lake architecture.

Hadoop

Hadoop Big Data Google Cloud NoSQL

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

DataOps Architecture Legacy data architectures, which have been widely used for decades, are often characterized by their rigidity and complexity. These systems typically consist of siloed data storage and processing environments, with manual processes and limited collaboration between teams.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

Beyond Garbage Collection: Tackling the Challenge of Orphaned Datasets

Ascend.io

MAY 23, 2023

The data engineering world is full of tips and tricks on how to handle specific patterns that recur with every data pipeline. Already in 2016, IBM estimated the cost of bad data to be over three trillion dollars, and that was before the chaos of data lakes emerged and orphaned datasets began to swamp the land.

Datasets

Datasets Data Pipeline Metadata Database

Data Engineering Glossary

Silectis

JANUARY 3, 2021

Data Architecture Data architecture is a composition of models, rules, and standards for all data systems and interactions between them. Data Catalog An organized inventory of data assets relying on metadata to help with data management.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

When to Build vs. Buy Your Data Warehouse (5 Key Factors)

Monte Carlo

JANUARY 25, 2023

When it comes to the question of building or buying your data stack, there’s never a one-size-fits-all solution for every data team—or every component of your data stack. Data storage and compute are very much the foundation of your data platform. Let’s jump in!

Data Warehouse

Data Warehouse Building Data Lake Data Storage

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

It was built from the ground up for interactive analytics and can scale to the size of Facebook while approaching the speed of commercial data warehouses. Presto allows you to query data stored in Hive, Cassandra, relational databases, and even bespoke data storage. To contribute to this project, hop onto: [link] 19.DataHub

Big Data

Big Data Project Metadata Programming Language

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Monte Carlo

JUNE 14, 2023

In this post, we will help you quickly level up your overall knowledge of data pipeline architecture by reviewing: Table of Contents What is data pipeline architecture? Why is data pipeline architecture important? These pipelines differ from traditional ELT pipelines by doing the data cleaning and normalization prior to load.

Data Pipeline

Data Pipeline Architecture Data Lake Data Warehouse

Data Engineering Annotated Monthly – August 2021

Big Data Tools

SEPTEMBER 6, 2021

There are also several changes in KRaft (namely Revise KRaft Metadata Records and Producer ID generation in KRaft mode ), along with many other changes. Unfortunately, the feature that was most awaited (at least by me) – tiered storage – has been postponed for a subsequent release. Support for Scala 2.12 And more files means more time.

Data Engineering

Data Engineering Data Engineer Engineering Big Data Tools

Snowflake Architecture and It's Fundamental Concepts

ProjectPro

JANUARY 31, 2022

It also offers a unique architecture that allows users to quickly build tables and begin querying data without administrative or DBA involvement. Snowflake is a cloud-based data platform that provides excellent manageability regarding data warehousing, data lakes, data analytics, etc. What Does Snowflake Do?

Architecture

Architecture IT Data Warehouse Amazon Web Services

Emerging Big Data Trends for 2023

ProjectPro

FEBRUARY 8, 2017

In 2017, big data platforms that are just built only for hadoop will fail to continue and the ones that are data and source agnostic will survive. Organizations are embarking on data lake strategy for applications that are centralized and for applications coming together on a single central platform.

Big Data

Big Data Hadoop Data Lake Machine Learning

Big Data Fabric Weaves Together Automation, Scalability, and Intelligence

Cloudera

JANUARY 22, 2019

Forrester describes Big Data Fabric as, “A unified, trusted, and comprehensive view of business data produced by orchestrating data sources automatically, intelligently, and securely, then preparing and processing them in big data platforms such as Hadoop and Apache Spark, data lakes, in-memory, and NoSQL.”.

Big Data

Big Data NoSQL Hadoop Data Lake

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

Based on Tecton blog So is this similar to data engineering pipelines into a data lake/warehouse? Snowflake can also ingest external tables from on-premise s data sources via S3-compliant data storage APIs. Yes, feature stores are part of the MLOps discipline.

Engineering

Engineering Raw Data Data Science Machine Learning

Data Observability: Five Quick Ways to Improve the Reliability of Your Data

Monte Carlo

SEPTEMBER 23, 2021

Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it. Throughout this time, data is transformed, often more than once.

Data

Data BI Metadata Data Pipeline

15+ Must Have Data Engineer Skills in 2023

Knowledge Hut

NOVEMBER 28, 2023

The cloud could also be full of semi-structured or unstructured data with more than 225 no SQL schema data stores, which makes it one of the most important skills to be thorough with. Data Mining Tools Metadata adds business context to your data and helps transform it into understandable knowledge.

Data Engineering

Data Engineering Data Engineer Engineering Generalist

Now in Public Preview: Processing Files and Unstructured Data with Snowpark for Python

Snowflake

JULY 10, 2023

Snowpark allowed us to be able to reference the unique URL of each PDF that was added to the stream connected to our data lake, which unlocked the ability to process that PDF natively within Snowflake. Using a Python UDF, we were able to accomplish this task.

Unstructured Data

Unstructured Data Python Process Scala

100+ Big Data Interview Questions and Answers 2023

ProjectPro

JANUARY 31, 2023

There are three steps involved in the deployment of a big data model: Data Ingestion: This is the first step in deploying a big data model - Data ingestion, i.e., extracting data from multiple data sources. Data Variety Hadoop stores structured, semi-structured and unstructured data.

Big Data

Big Data Hadoop Relational Database AWS

How Apache Iceberg Is Changing the Face of Data Lakes

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Trending Sources

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Webinars

Data Lake vs. Data Warehouse vs. Data Lakehouse

Data Lakes vs. Data Warehouses

Top Data Lake Vendors (Quick Reference Guide)

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Iceberg Is An Implementation Detail

Reflections On Designing A Data Platform From Scratch

Data Lake vs Data Warehouse - Working Together in the Cloud

Get Your Analytics Insights Instantly – Without Abandoning Central IT

How to learn data engineering

Hands-On Introduction to Delta Lake with (py)Spark

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

The Evolution of Table Formats

5 Layers of Data Lakehouse Architecture Explained

Data Lakehouse Architecture Explained: 5 Layers

Monte Carlo Announces Delta Lake, Unity Catalog Integrations To Bring End-to-End Data Observability to Databricks

Data Lakehouse: Concept, Key Features, and Architecture Layers

Mainframe Optimization: 5 Best Practices to Implement Now

Data Engineering Weekly #164

Data Scientist vs Data Engineer: Differences and Why You Need Both

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

Costwiz: Saving cost for LinkedIn enterprise on Azure

Iceberg, Right Ahead! 7 Apache Iceberg Best Practices for Smooth Data Sailing

Data Engineering Annotated Monthly – August 2021

The Good and the Bad of Hadoop Big Data Framework

DataOps Architecture: 5 Key Components and How to Get Started

Beyond Garbage Collection: Tackling the Challenge of Orphaned Datasets

Data Engineering Glossary

When to Build vs. Buy Your Data Warehouse (5 Key Factors)

20 Best Open Source Big Data Projects to Contribute on GitHub

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Data Engineering Annotated Monthly – August 2021

Snowflake Architecture and It's Fundamental Concepts

Emerging Big Data Trends for 2023

Big Data Fabric Weaves Together Automation, Scalability, and Intelligence

Data Vault on Snowflake: Feature Engineering and Business Vault

Data Observability: Five Quick Ways to Improve the Reliability of Your Data

15+ Must Have Data Engineer Skills in 2023

Now in Public Preview: Processing Files and Unstructured Data with Snowpark for Python

100+ Big Data Interview Questions and Answers 2023

Stay Connected