Metadata, Structured Data and Systems - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew. The data warehouse solved for performance and scale but, much like the databases that preceded it, relied on proprietary formats to build vertically integrated systems.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

The trend to centralize data will accelerate, making sure that data is high-quality, accurate and well managed. Overall, data must be easily accessible to AI systems, with clear metadata management and a focus on relevance and timeliness.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Meanwhile, operations teams use entity extraction on documents to automate workflows and enable metadata-driven analytical filtering. Entity extraction : Extracting key entities (names, dates, locations, financial figures) from contracts, invoices or medical records to transform unstructured text into structured data.

Unstructured Data

Unstructured Data Medical Media Data Workflow

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

Data Silos: Breaking down barriers between data sources. Hadoop achieved this through distributed processing and storage, using a framework called MapReduce and the Hadoop Distributed File System (HDFS). This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., S3 Tables: A New Player?

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Your Enterprise Data Needs an Agent

Snowflake

FEBRUARY 12, 2025

AI agents, autonomous systems that perform tasks using AI, can enhance business productivity by handling complex, multi-step operations in minutes. Agents need to access an organization's ever-growing structured and unstructured data to be effective and reliable. text, audio) and structured (e.g.,

Unstructured Data

Unstructured Data Government SQL Structured Data

A Flexible and Efficient Storage System for Diverse Workloads

Cloudera

SEPTEMBER 15, 2022

Today’s platform owners, business owners, data developers, analysts, and engineers create new apps on the Cloudera Data Platform and they must decide where and how to store that data. Structured data (such as name, date, ID, and so on) will be stored in regular SQL databases like Hive or Impala databases.

Systems

Systems Hadoop Metadata Telecommunication

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

Data Engineering Podcast

JUNE 17, 2021

Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. The data you’re looking for is already in your data warehouse and BI tools. No more scripts, just SQL.

Unstructured Data

Unstructured Data Data Warehouse Metadata Media

Data Engineering Weekly #203

Data Engineering Weekly

JANUARY 12, 2025

Learn practical strategies to optimize Airflow performance and streamline operations: - Fine-tune configurations to enhance workflow efficiency - Automate Airflow deployments and manage users seamlessly - Monitor system health with advanced observability tools and alerts Join this live session and learn how to scale Airflow efficiently.

Pipeline-centric

Pipeline-centric Data Engineer Data Engineering Engineering

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

To give customers flexibility for how they fit Snowflake into their architecture, Iceberg Tables can be configured to use either Snowflake or an external service like AWS Glue as the tables’s catalog to track metadata, with an easy one-line SQL command to convert to Snowflake in a metadata-only operation.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Recommender Systems: Behind the Scenes of Machine-Learning-Based Personalization

AltexSoft

JULY 27, 2021

You’ll learn about the types of recommender systems, their differences, strengths, weaknesses, and real-life examples. Personalization and recommender systems in a nutshell. Primarily developed to help users deal with a large range of choices they encounter, recommender systems come into play. Amazon, Booking.com) and.

Machine Learning

Machine Learning Systems Algorithm Deep Learning

Logarithm: A logging engine for AI training workflows and services

Engineering at Meta

MARCH 18, 2024

Systems and application logs play a key role in operations, observability, and debugging workflows at Meta. We designed the system to support service-level guarantees on log freshness, completeness, durability, query latency, and query result completeness. Each log line can have zero or more metadata key-value pairs attached to it.

Engineering

Engineering Metadata Architecture Designing

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Data Engineering Podcast

JULY 1, 2018

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. When doing data collection from various sources, how do you ensure that intellectual property rights are respected?

Metadata

Metadata Machine Learning Data Preparation Data Collection

The Future Is Hybrid Data, Embrace It

Cloudera

JUNE 7, 2022

We live in a hybrid data world. In the past decade, the amount of structured data created, captured, copied, and consumed globally has grown from less than 1 ZB in 2011 to nearly 14 ZB in 2020. Impressive, but dwarfed by the amount of unstructured data, cloud data, and machine data – another 50 ZB.

IT

IT Unstructured Data Data Architecture Government

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

You don’t need to archive or clean data before loading. The system automatically replicates information to prevent data loss in the case of a node failure. To understand how the entire mechanism works, we need to get familiar with Hadoop structure and key parts. A file stored in the system ?an’t fail-safety.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

DotSlash: Simplified executable deployment

Engineering at Meta

FEBRUARY 6, 2024

DotSlash handles transparently fetching, decompressing, and verifying the appropriate remote artifact for the current operating system and CPU. Our continuous integration (CI) system supports special configuration for DotSlash jobs where a user must specify: A set of builds to run (these can span multiple platforms).

Metadata

Metadata Coding Building Project

Cleaning And Curating Open Data For Archaeology

Data Engineering Podcast

FEBRUARY 3, 2019

Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports. What are your protocols for determining which data sets you will work with?

Digital Media

Digital Media Media PostgreSQL Datasets

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

In the previous blog posts in this series, we introduced the N etflix M edia D ata B ase ( NMDB ) and its salient “Media Document” data model. In this post we will provide details of the NMDB system architecture beginning with the system requirements?—?these key value stores generally allow storing any data under a key).

Media

Media Database Metadata Data Schemas

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). Tables are governed as per agreed upon company standards.

Big Data

Big Data Data Management Management Metadata

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

Netflix Tech

MARCH 5, 2019

This operational component places some cognitive load on our engineers, requiring them to develop deep understanding of telemetry and alerting systems, capacity provisioning process, security and reliability best practices, and a vast amount of informal knowledge about the cloud infrastructure.

Cloud

Cloud Building Amazon Web Services Metadata

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

For this reason, a new data management for ML framework has emerged to help manage this complexity: the “feature store.” Feature store As described in Tecton’s blog , a feature store is a data management system for managing ML feature pipelines, including the management of feature engineering code and data.

Engineering

Engineering Raw Data Data Science Machine Learning

How to get powerful and actionable insights from any and all of your data, without delay

Cloudera

SEPTEMBER 17, 2020

By enabling their event analysts to monitor and analyze events in real time, as well as directly in their data visualization tool, and also rate and give feedback to the system interactively, they increased their data to insight productivity by a factor of 10. .

Data Warehouse

Data Warehouse Unstructured Data Pharmaceutical MySQL

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

Cyber defenders struggle with: Too much data: Cybersecurity tools generate an overwhelming volume of log data, including Domain Name Service (DNS) records, firewall logs, and more. All of this data is essential for investigations and threat hunting, but existing systems often struggle to manage it efficiently.

Metadata

Metadata Unstructured Data Data Lake Government

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

This blog will guide you through the best data modeling methodologies and processes for your data lake, helping you make informed decisions and optimize your data management practices. What is a Data Lake? What are Data Modeling Methodologies, and Why Are They Important for a Data Lake?

Data Lake

Data Lake Process Metadata Data Warehouse

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Netflix Tech

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. truncate or drop a table) this allows us to cheaply recycle old versions of the data.

Data Warehouse

Data Warehouse Datasets Data Big Data

Data Lakehouse: Concept, Key Features, and Architecture Layers

AltexSoft

NOVEMBER 10, 2021

In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its raw formats just like data lakes. At the same time, it brings structure to data and empowers data management features similar to those in data warehouses by implementing the metadata layer on top of the store.

Architecture

Architecture Data Lake Data Warehouse Metadata

What is Data Fabric: Architecture, Principles, Advantages, and Ways to Implement

AltexSoft

AUGUST 22, 2022

A data fabric is an architecture design presented as an integration and orchestration layer built on top of multiple disjointed data sources like relational databases , data warehouses , data lakes, data marts , IoT , legacy systems, etc., to provide a unified view of all enterprise data.

Architecture

Architecture Metadata Data Lake Machine Learning

Netflix MediaDatabase?—?Media Timeline Data Model

Netflix Tech

OCTOBER 31, 2018

the Media Timeline Data Model In the previous post in this series, we described some important Netflix business needs as well as traits of the media data system?—?called The curious reader might have noticed that a majority of these characteristics relate to properties of the data managed by NMDB.

Media

Media Metadata Data MongoDB

The Symbiotic Relationship Between AI and Data Engineering

Ascend.io

FEBRUARY 28, 2024

And crucially, what does the future hold for data engineering in an AI-driven world? While data engineering and Artificial Intelligence (AI) may seem like distinct fields at first glance, their symbiosis is undeniable. The foundation of any AI system is high-quality data.

Data Engineer

Data Engineer Data Engineering Engineering Metadata

How we manage our 1200 incident playbooks

Zalando Engineering

JANUARY 30, 2023

Our Incident Playbooks cover emergency procedures to initiate in case a certain set of conditions is met, for example when one of our systems is overloaded and the existing resiliency measures (e.g. When the bigger system context is considered, there are more options available to mitigate issues. processing of price updates).

Management

Management Metadata Software Engineer Software Engineering

Hands-On Introduction to Delta Lake with (py)Spark

Towards Data Science

FEBRUARY 15, 2023

The Data Lake architecture was proposed in a period of great growth in the data volume, especially in non-structured and semi-structured data, when traditional Data Warehouse systems start to become incapable of dealing with this demand. The data became useless. delta_table.history().select("version",

Data Lake

Data Lake Data Warehouse Hadoop Architecture

A Definitive Guide to Using BigQuery Efficiently

Towards Data Science

MARCH 5, 2024

The storage system is using Capacitor, a proprietary columnar storage format by Google for semi-structured data and the file system underneath is Colossus, the distributed file system by Google. depending on location) BigQuery maintains a lot of valuable metadata about tables, columns and partitions.

Bytes

Bytes Google Cloud Cloud Storage Utilities

Mastering the Art of ETL on AWS for Data Management

ProjectPro

FEBRUARY 16, 2023

Data integration with ETL has evolved from structured data stores with high computing costs to natural state storage with read operation alterations thanks to the agility of the cloud. Data integration with ETL has changed in the last three decades. AWS Glue has a central metadata repository called the Glue catalog.

AWS

AWS Data Management ETL Tools Management

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications. While data warehouses are still in use, they are limited in use-cases as they only support structured data.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

Snapshot testing augments debugging capabilities by recording past table states, facilitating the identification of unforeseen spikes, declines, or abnormalities before their effect on production systems. Data freshness propagation: No automatic tracking of data propagation delays across multiplemodels.

Unstructured Data

Unstructured Data SQL Data Pipeline Data Validation

An Engineering Guide to Data Creation - A Data Contract perspective - Part 1

Data Engineering Weekly

MARCH 24, 2023

Data creation is often the differentiator between the success & the failure of a data team. A business process or workflow engine is a software system that enables businesses to execute well-defined steps to complete a user’s intention. Event Sourcing Change Data Capture [CDC] Outbox pattern 1.

Engineering

Engineering Data Transportation Database

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

Instead of relying on traditional hierarchical structures and predefined schemas, as in the case of data warehouses, a data lake utilizes a flat architecture. This structure is made efficient by data engineering practices that include object storage. Watch our video explaining how data engineering works.

Data Lake

Data Lake Architecture IT Amazon Web Services

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

Data Lake vs Data Warehouse - The Differences Before we closely analyse some of the key differences between a data lake and a data warehouse, it is important to have an in depth understanding of what a data warehouse and data lake is. Data Lake vs Data Warehouse - The Introduction What is a Data warehouse?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Powering SQL Draw with Rockset, Retool and dbt

Rockset

DECEMBER 17, 2021

Here’s the final architecture: I’ve been doing some flavour of systems integration for the past 15 years, and usually I finish a project and think “it shouldn’t have taken that much effort”. For those unfamiliar, DynamoDB makes database scalability a breeze, but with some major caveats.

SQL

SQL NoSQL Database Design Metadata

Comparing Performance of Big Data File Formats: A Practical Guide

Towards Data Science

JANUARY 17, 2024

Parquet vs ORC vs Avro vs Delta Lake Photo by Viktor Talashuk on Unsplash The big data world is full of various storage systems, heavily influenced by different file formats. These are key in nearly all data pipelines, allowing for efficient data storage and easier querying and information extraction.

Big Data

Big Data Data Data Storage SQL

The Good and the Bad of Hadoop Big Data Framework

AltexSoft

JULY 29, 2022

A Hadoop cluster is a group of computers called nodes that act as a single centralized system working on the same task. a client or edge node serves as a gateway between a Hadoop cluster and outer systems and applications. It loads data and grabs the results of the processing staying outside the master-slave hierarchy.

Hadoop

Hadoop Big Data Google Cloud NoSQL

100+ Big Data Interview Questions and Answers 2023

ProjectPro

JANUARY 31, 2023

Key features Hadoop RDBMS Overview Hadoop is an open-source software collection that links several computers to solve problems requiring large quantities of data and processing. RDBMS is a part of system software used to create and manage databases based on the relational model. RDBMS stores structured data.

Big Data

Big Data Hadoop Relational Database AWS

Difference between Pig and Hive-The Two Key Components of Hadoop Ecosystem

ProjectPro

OCTOBER 15, 2014

Hive- Performance Benchmarking Hive vs Pig Pig vs Hive - Differences Pig Hive Procedural Data Flow Language Declarative SQLish Language For Programming For creating reports Mainly used by Researchers and Programmers Mainly used by Data Analysts Operates on the client side of a cluster. Does not have a dedicated metadata database.

Hadoop

Hadoop Java Unstructured Data SQL

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

AltexSoft

SEPTEMBER 23, 2021

The larger the company, the more data it has to generate actionable insights. Because it is scattered across disparate systems, hardly available for analytical apps. Evidently, common storage solutions fail to provide a unified data view and meet the needs of companies for seamless data flow. Data lake vs data hub.

Architecture

Architecture Data Lake Unstructured Data Data Warehouse

Data Engineering Glossary

Silectis

JANUARY 3, 2021

” Artificial Intelligence AI is a broad term used to describe engineered systems that have been taught to do a task that typically requires human intelligence. BI (Business Intelligence) Strategies and systems used by enterprises to conduct data analysis and make pertinent business decisions.

Data Engineer

Data Engineer Data Engineering Engineering Hadoop

How Apache Iceberg Is Changing the Face of Data Lakes

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Trending Sources

Scale Unstructured Text Analytics with Batch LLM Inference

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Your Enterprise Data Needs an Agent

A Flexible and Efficient Storage System for Diverse Workloads

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

Data Engineering Weekly #203

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Recommender Systems: Behind the Scenes of Machine-Learning-Based Personalization

Logarithm: A logging engine for AI training workflows and services

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

The Future Is Hybrid Data, Embrace It

Hadoop vs Spark: Main Big Data Tools Explained

DotSlash: Simplified executable deployment

Cleaning And Curating Open Data For Archaeology

Implementing the Netflix Media Database

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

Data Vault on Snowflake: Feature Engineering and Business Vault

How to get powerful and actionable insights from any and all of your data, without delay

Empower Your Cyber Defenders with Real-Time Analytics

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Data Lakehouse: Concept, Key Features, and Architecture Layers

What is Data Fabric: Architecture, Principles, Advantages, and Ways to Implement

Netflix MediaDatabase?—?Media Timeline Data Model

The Symbiotic Relationship Between AI and Data Engineering

How we manage our 1200 incident playbooks

Hands-On Introduction to Delta Lake with (py)Spark

A Definitive Guide to Using BigQuery Efficiently

Mastering the Art of ETL on AWS for Data Management

Data Lake vs. Data Warehouse vs. Data Lakehouse

Ensuring Data Transformation Quality with dbt Core

An Engineering Guide to Data Creation - A Data Contract perspective - Part 1

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Data Lake vs Data Warehouse - Working Together in the Cloud

Powering SQL Draw with Rockset, Retool and dbt

Comparing Performance of Big Data File Formats: A Practical Guide

The Good and the Bad of Hadoop Big Data Framework

100+ Big Data Interview Questions and Answers 2023

Difference between Pig and Hive-The Two Key Components of Hadoop Ecosystem

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

Data Engineering Glossary

Stay Connected