Data Storage and Metadata - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Apache Ozone Metadata Explained

Cloudera

JUNE 2, 2021

As an important part of achieving better scalability, Ozone separates the metadata management among different services: . Ozone Manager (OM) service manages the metadata of the namespace such as volume, bucket and keys. Datanode service manages the metadata of blocks, containers and pipelines running on the datanode. .

Metadata

Metadata Hadoop Certification Algorithm

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

AUGUST 3, 2018

Data powers Uber’s global marketplace, enabling more reliable and seamless user experiences across our products for riders, … The post Databook: Turning Big Data into Knowledge with Metadata at Uber appeared first on Uber Engineering Blog.

Metadata

Metadata Big Data Transportation Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

This switch has been lead by modern data stack vision. In terms of paradigms before 2012 we were doing ETL because storage was expensive, so it became a requirement to transform data before the data storage—mainly a data warehouse, to have the most optimised data for querying.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

In the realm of modern analytics platforms, where rapid and efficient processing of large datasets is essential, swift metadata access and management are critical for optimal system performance. Any delays in metadata retrieval can negatively impact user experience, resulting in decreased productivity and satisfaction. What is Atlas?

Metadata

Metadata PostgreSQL Java Database

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Though basic and easy to use, traditional table storage formats struggle to keep up. Track data files within the table along with their column statistics.

Architecture

Architecture Systems Data Lake Google Cloud

On-Premise vs Cloud: Where Does the Future of Data Storage Lie?

Monte Carlo

AUGUST 15, 2023

Regardless, the important thing to understand is that the modern data stack doesn’t just allow you to store and process bigger data faster, it allows you to handle data fundamentally differently to accomplish new goals and extract different types of value. It’s just a matter of picking a flavor.

Data Storage

Data Storage Cloud Metadata Machine Learning

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Both companies have added Data and AI to their slogan, Snowflake used to be The Data Cloud and now they're The AI Data Cloud. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table. metadata file — Table metadata is stored as JSON.

Metadata

Metadata Data Warehouse BI MySQL

Snowflake and the Pursuit Of Precision Medicine

Snowflake

NOVEMBER 29, 2023

For example, the data storage systems and processing pipelines that capture information from genomic sequencing instruments are very different from those that capture the clinical characteristics of a patient from a site. The principles emphasize machine-actionability (i.e.,

Metadata

Metadata Healthcare Medical Data Storage

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Podcast

DECEMBER 16, 2019

The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools.

Metadata

Metadata PostgreSQL Datasets Data Warehouse

Iceberg Is An Implementation Detail

dbt Developer Hub

OCTOBER 3, 2024

If you haven’t paid attention to the data industry news cycle, you might have missed the recent excitement centered around an open table format called Apache Iceberg™. These formats are changing the way data is stored and metadata accessed. Storage systems should just work.” “We They are groundbreaking in many ways.

Metadata

Metadata Data Lake Data Storage Accessibility

Reflections On Designing A Data Platform From Scratch

Data Engineering Podcast

FEBRUARY 27, 2022

Batch or streaming (acceptable latencies) Data storage (lake or warehouse) How is the data going to be used? The warehouse (Bigquery, Snowflake, Redshift) has become the focal point of the "modern data stack" Data orchestration Who will be managing the workflow logic?

Designing

Designing Metadata Data Lake Relational Database

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix Tech

SEPTEMBER 18, 2024

The Key-Value Service The KV data abstraction service was introduced to solve the persistent challenges we faced with data access patterns in our distributed databases. The KV data can be visualized at a high level, as shown in the diagram below, where three records are shown. number of chunks).

Bytes

Bytes Metadata Database Data

Data News — Week 23.24

Christophe Blefari

JUNE 16, 2023

The power of pre-commit and SQLFluff —SQL is a query programming language used to retrieve information from data storages, and like any other programming language, you need to enforce checks at all times. also as a data observability platform, but done differently. It covers simple SELECT and advanced concepts.

Programming Language

Programming Language SQL PostgreSQL Data

Introducing Vector Search on Rockset: How to run semantic search with OpenAI and Rockset

Rockset

APRIL 18, 2023

Under the hood, Rockset utilizes its Converged Index technology, which is optimized for metadata filtering, vector search and keyword search, supporting sub-second search, aggregations and joins at scale. Feature Generation: Transform and aggregate data during the ingest process to generate complex features and reduce data storage volumes.

Unstructured Data

Unstructured Data Metadata Machine Learning SQL

Setting The Stage For The Next Chapter Of The Cassandra Database

Data Engineering Podcast

SEPTEMBER 12, 2021

That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard.

Database

Database Kafka Metadata Data Storage

Making messaging interoperability with third parties safe for users in Europe

Engineering at Meta

MARCH 6, 2024

When a third-party user registers on WhatsApp or Messenger, they keep their existing user-visible identifier, and are also assigned a unique, WhatsApp-internal identifier that is used at the infrastructure level (for protocols, data storage, etc.)

Media

Media Architecture Metadata Data Storage

Top 7 Mobile Security Threats and Prevention

Edureka

MARCH 20, 2025

These scams often target passwords, banking details, or sensitive organizational data by posing as a boss or coworker requesting confidential information. These apps may silently harvest personal data or metadata and, in some cases, install malware onto the device.

Banking

Banking Entertainment Media Transportation

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Master Nodes control and coordinate two key functions of Hadoop: data storage and parallel processing of data. Worker or Slave Nodes are the majority of nodes used to store data and run computations according to instructions from a master node. and keeps track of storage capacity, a volume of data being transferred, etc.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

A Flexible and Efficient Storage System for Diverse Workloads

Cloudera

SEPTEMBER 15, 2022

Structured data (such as name, date, ID, and so on) will be stored in regular SQL databases like Hive or Impala databases. There are also newer AI/ML applications that need data storage, optimized for unstructured data using developer friendly paradigms like Python Boto API. FILE_SYSTEM_OPTIMIZED Bucket (“FSO”).

Systems

Systems Hadoop Metadata Telecommunication

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). Tables are governed as per agreed upon company standards.

Big Data

Big Data Data Management Management Metadata

The State of Data Engineering in 2024: Key Insights and Trends

Data Engineering Weekly

DECEMBER 16, 2024

Grab’s Metasense , Uber’s DataK9 , and Meta’s classification systems use AI to automatically categorize vast data sets, reducing manual efforts and improving accuracy. Beyond classification, organizations now use AI for automated metadata generation and data lineage tracking, creating more intelligent data infrastructures.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

formats — This is a huge part of data engineering. Picking the right format for your data storage. Read technical blogs, watch conferences and read 📘 Designing Data-Intensive Applications (even if it could be overkill). Wrong format often means bad querying performance and user-experience.

Data Engineer

Data Engineer Data Engineering Engineering Hadoop

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

When Glue receives a trigger, it collects the data, transforms it using code that Glue generates automatically, and then loads it into Amazon S3 or Amazon Redshift. Then, Glue writes the job's metadata into the embedded AWS Glue Data Catalog. being data exactly matches the classifier, and 0.0 Why Use AWS Glue?

AWS

AWS Scala Metadata Data Lake

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance.

IT

IT Data Lake Data Warehouse Cloud Storage

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve. NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries.

Media

Media Database Metadata Data Schemas

Getting Started with Cloudera Data Platform Operational Database (COD)

Cloudera

NOVEMBER 23, 2021

Security and governance policies are set once and applied across all data and workloads. Atlas provides open metadata management and governance capabilities to build a catalog of all assets, and also classify and govern these assets. Build and run the applications. Apache HBase.

Database

Database Non-relational Database NoSQL Government

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

OCTOBER 19, 2020

Distributed Tracing: the missing context in troubleshooting services at scale Prior to Edgar, our engineers had to sift through a mountain of metadata and logs pulled from various Netflix microservices in order to understand a specific streaming failure experienced by any of our members.

Building

Building Transportation Java Metadata

dbt Core, Snowflake, and GitHub Actions: pet project for Data Engineers

Towards Data Science

DECEMBER 1, 2023

Storage — Snowflake Snowflake, a cloud-based data warehouse tailored for analytical needs, will serve as our data storage solution. The data volume we will deal with is small, so we will not try to overkill with data partitioning, time travel, Snowpark, and other Snowflake advanced capabilities.

Data Engineer

Data Engineer Data Engineering Project Engineering

Mainframe Optimization: 5 Best Practices to Implement Now

Precisely

JANUARY 25, 2024

Today’s cloud systems excel at high-volume data storage, powerful analytics, AI, and software & systems development. You must carefully consider various mainframe functions, including security, system logs, metadata, and COBOL copybooks when moving to the new cloud platform. Best Practice 2. Best Practice 3.

Metadata

Metadata Relational Database Data Governance Government

Observe Everything

Cloudera

MARCH 22, 2023

While a business analyst may wonder why the values in their customer satisfaction dashboard have not changed since yesterday, a DBA may want to know why one of today’s queries took so long, and a system administrator needs to find out why data storage is skewed to a few nodes in the cluster.

Data Governance

Data Governance Government Business Analyst Metadata

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

This blog will guide you through the best data modeling methodologies and processes for your data lake, helping you make informed decisions and optimize your data management practices. What is a Data Lake? What are Data Modeling Methodologies, and Why Are They Important for a Data Lake?

Data Lake

Data Lake Process Metadata Data Warehouse

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Cloudera

AUGUST 31, 2021

While this “data tsunami” may pose a new set of challenges, it also opens up opportunities for a wide variety of high value business intelligence (BI) and other analytics use cases that most companies are eager to deploy. . Traditional data warehouse vendors may have maturity in data storage, modeling, and high-performance analysis.

Data Warehouse

Data Warehouse Database-centric Metadata Cloud

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

OCTOBER 15, 2021

With FSO, Apache Ozone guarantees atomic directory operations, and renaming or deleting a directory is a simple metadata operation even if the directory has a large set of sub-paths (directories/files) within it. In fact, this gives Apache Ozone a significant performance advantage over other object stores in the data analytics ecosystem.

Cloud

Cloud Hadoop Data Analytics Metadata

Unlocking Effective Data Governance with Unity Catalog – Data Bricks

RandomTrees

SEPTEMBER 17, 2024

The Unity Catalog is Databricks governance solution which integrates with Databricks workspaces and provides a centralized platform for managing metadata, data access, and security. Improved Data Discovery The tagging and documentation features in Unity Catalog facilitate better data discovery.

Data Governance

Data Governance Government Metadata Machine Learning

Data Engineering Weekly #164

Data Engineering Weekly

MARCH 24, 2024

The APIs support emitting unstructured log lines and typed metadata key-value pairs (per line). The extracted key-value pairs are written to the line’s metadata. Query clusters support interactive and bulk queries on one or more log streams with predicate filters on log text and metadata.

Data Engineer

Data Engineer Data Engineering Engineering Metadata

The Evolution of Table Formats

Monte Carlo

MAY 14, 2024

At its core, a table format is a sophisticated metadata layer that defines, organizes, and interprets multiple underlying data files. Table formats incorporate aspects like columns, rows, data types, and relationships, but can also include information about the structure of the data itself.

Data Lake

Data Lake Metadata Hadoop Data Governance

5 Layers of Data Lakehouse Architecture Explained

Monte Carlo

JANUARY 5, 2024

This architecture format consists of several key layers that are essential to helping an organization run fast analytics on structured and unstructured data. Table of Contents What is data lakehouse architecture? The 5 key layers of data lakehouse architecture 1. Storage layer 3. Metadata layer 4. API layer 5.

Architecture

Architecture Data Lake Metadata Unstructured Data

Data Lakehouse Architecture Explained: 5 Layers

Monte Carlo

JANUARY 5, 2024

This architecture format consists of several key layers that are essential to helping an organization run fast analytics on structured and unstructured data. Table of Contents What is data lakehouse architecture? The 5 key layers of data lakehouse architecture 1. Storage layer 3. Metadata layer 4. API layer 5.

Architecture

Architecture Data Lake Metadata Unstructured Data

Iceberg, Right Ahead! 7 Apache Iceberg Best Practices for Smooth Data Sailing

Monte Carlo

MAY 30, 2023

It’s designed to improve upon the performance and usability challenges of older data storage formats such as Apache Hive and Apache Parquet. For example, Monte Carlo can monitor Apache Iceberg tables for data quality incidents, where other data observability platforms may be more limited.

Metadata

Metadata Raw Data Data Lake Data

Data Engineering Annotated Monthly – August 2021

Big Data Tools

SEPTEMBER 6, 2021

There are also several changes in KRaft (namely Revise KRaft Metadata Records and Producer ID generation in KRaft mode ), along with many other changes. Unfortunately, the feature that was most awaited (at least by me) – tiered storage – has been postponed for a subsequent release. Support for Scala 2.12 And more files means more time.

Data Engineer

Data Engineer Data Engineering Engineering Big Data Tools

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

That’s why it’s essential for teams to choose the right architecture for the storage layer of their data stack. But, the options for data storage are evolving quickly. So let’s get to the bottom of the big question: what kind of data storage layer will provide the strongest foundation for your data platform?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Comparing Performance of Big Data File Formats: A Practical Guide

Towards Data Science

JANUARY 17, 2024

Parquet vs ORC vs Avro vs Delta Lake Photo by Viktor Talashuk on Unsplash The big data world is full of various storage systems, heavily influenced by different file formats. These are key in nearly all data pipelines, allowing for efficient data storage and easier querying and information extraction.

Big Data

Big Data Data Data Storage SQL

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data.

Data Lake

Data Lake Architecture IT Amazon Web Services

How Apache Iceberg Is Changing the Face of Data Lakes

Apache Ozone Metadata Explained

Webinars

Trending Sources

Databook: Turning Big Data into Knowledge with Metadata at Uber

Webinars

How to get started with dbt

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

Why Open Table Format Architecture is Essential for Modern Data Systems

On-Premise vs Cloud: Where Does the Future of Data Storage Lie?

Databricks, Snowflake and the future

Snowflake and the Pursuit Of Precision Medicine

Solving Data Lineage Tracking And Data Discovery At WeWork

Iceberg Is An Implementation Detail

Reflections On Designing A Data Platform From Scratch

Introducing Netflix’s Key-Value Data Abstraction Layer

Data News — Week 23.24

Introducing Vector Search on Rockset: How to run semantic search with OpenAI and Rockset

Setting The Stage For The Next Chapter Of The Cassandra Database

Making messaging interoperability with third parties safe for users in Europe

Top 7 Mobile Security Threats and Prevention

Hadoop vs Spark: Main Big Data Tools Explained

A Flexible and Efficient Storage System for Diverse Workloads

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

The State of Data Engineering in 2024: Key Insights and Trends

How to learn data engineering

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Implementing the Netflix Media Database

Getting Started with Cloudera Data Platform Operational Database (COD)

Building Netflix’s Distributed Tracing Infrastructure

dbt Core, Snowflake, and GitHub Actions: pet project for Data Engineers

Mainframe Optimization: 5 Best Practices to Implement Now

Observe Everything

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Unlocking Effective Data Governance with Unity Catalog – Data Bricks

Data Engineering Weekly #164

The Evolution of Table Formats

5 Layers of Data Lakehouse Architecture Explained

Data Lakehouse Architecture Explained: 5 Layers

Iceberg, Right Ahead! 7 Apache Iceberg Best Practices for Smooth Data Sailing

Data Engineering Annotated Monthly – August 2021

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Comparing Performance of Big Data File Formats: A Practical Guide

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Stay Connected