Accessibility, Events and Metadata - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Iceberg tables become interoperable while maintaining ACID compliance by adding a layer of metadata to the data files in a users object storage. An external catalog tracks the latest table metadata and helps ensure consistency across multiple readers and writers. Put simply: Iceberg is metadata.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

The startup was able to start operations thanks to getting access to an EU grant called NGI Search grant. Results are stored in git and their database, together with benchmarking metadata. Benchmarking results for each instance type are stored in sc-inspector-data repo, together with the benchmarking task hash and other metadata. There

Cloud

Cloud AWS Metadata Cloud Computing

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

This scenario underscored the need for a new recommender system architecture where member preference learning is centralized, enhancing accessibility and utility across different models. To harness this data effectively, we employ a process of interaction tokenization, ensuring meaningful events are identified and redundancies are minimized.

Metadata

Metadata Bytes Entertainment Data Mining

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Metadata Management And Integration At LinkedIn With DataHub

Data Engineering Podcast

AUGUST 24, 2020

The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. What were you using at LinkedIn for metadata management prior to the introduction of DataHub?

Metadata

Metadata Management Kafka Data Engineering

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

We are committed to building the data control plane that enables AI to reliably access structured data from across your entire data lineage. We believe it is important for the industry to start coalescing on best practices for safe and trustworthy ways to access your business data via LLM. What is MCP?

Structured Data

Structured Data SQL BI Project

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

However, these tools are limited by their lack of access to runtime data, which can lead to false positives from unexecuted code. Improving consumption experience : streamline the consumption experience to make it easier for developers and stakeholders to access and utilize data lineage information.

Data Warehouse

Data Warehouse SQL Programming Language Data

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Maintenance Processes: Operations that optimize Iceberg tables, such as compacting small files and managing metadata. Metadata Overhead: Iceberg relies heavily on metadata to track table changes and enable features like time travel.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

Collecting Raw Impression Events As Netflix members explore our platform, their interactions with the user interface spark a vast array of raw events. These events are promptly relayed from the client side to our servers, entering a centralized event processing queue.

Kafka

Kafka Datasets Metadata Utilities

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Ingest data more efficiently and manage costs For data managed by Snowflake, we are introducing features that help you access data easily and cost-effectively. This reduces the overall complexity of getting streaming data ready to use: Simply create external access integration with your existing Kafka solution.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

link] Event Alert: MLOps World/ Gen AI World - Austin, TX - Nov 7-8 The Gen AI Summit, consisting of a wider group of 20,000 Engineers, AI entrepreneurs, and Scientists, will host 1,000 AI teams in Austin, TX, November 7-8. Passes include app-brain-date networking, birds of a feature, post-event parties, etc. What are you waiting for?

Data Engineer

Data Engineer Data Engineering Engineering Pipeline-centric

The last (but not least)”ops” you need for your data : DataGovops

François Nguyen

JANUARY 18, 2021

In every step,we do not just read, transform and write data, we are also doing that with the metadata. Every data governance policy about this topic must be read by a code to act in your data platform (access management, masking, etc.) Who has an access to this Data ? Last part, it was added the data security and privacy part.

Data Governance

Data Governance Metadata Government Data Pipeline

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

Summary Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis Logo]([link] Developing event-driven pipelines is going to be a lot easier - Meet Functions!

Architecture

Architecture Data Lake High Quality Data SQL

Sysmon Security Event Processing in Real Time with KSQL and HELK

Confluent

FEBRUARY 21, 2019

During a recent talk titled Hunters ATT&CKing with the Right Data , which I presented with my brother Jose Luis Rodriguez at ATT&CKcon, we talked about the importance of documenting and modeling security event logs before developing any data analytics while preparing for a threat hunting engagement. Yeah…I can do that already!

Process

Process Kafka SQL Datasets

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Kafka is designed for streaming events, but Fluss is designed for streaming analytics. This capability, termed Union Read, allows both layers to work in tandem for highly efficient and accurate data access. It excels in event-driven architectures and data pipelines. How do you compare Fluss with Apache Kafka?

Kafka

Kafka Lambda Architecture SQL Architecture

Ensuring the Successful Launch of Ads on Netflix

Netflix Tech

JUNE 1, 2023

New content or national events may drive brief spikes, but, by and large, traffic is usually smoothly increasing or decreasing. It also included metadata about ads, such as ad placement and impression-tracking events. We stored these responses in a Keystone stream with outputs for Kafka and Elasticsearch.

Algorithm

Algorithm Kafka Metadata Systems

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. This is what managing data without metadata feels like. This is what managing data without metadata feels like. Effective metadata management is no longer a luxury—it’s a necessity.

Metadata

Metadata IT Government High Quality Data

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Data and Metadata: Data inputs and data outputs produced based on the application logic. Also included, business and technical metadata, related to both data inputs / data outputs, that enable data discovery and achieving cross-organizational consensus on the definitions of data assets.

Architecture

Architecture Metadata Kafka Government

New Snowflake Features Released in September–November 2023

Snowflake

DECEMBER 12, 2023

At our recent Snowday event, we announced a wave of Snowflake product innovations for easier application development, new AI and LLM capabilities, better cost management and more. If you missed the event or need a refresh of what was presented, watch any Snowday session on demand. Learn more about Iceberg Tables here. Learn more.

Metadata

Metadata Python AWS Government

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

Data & Metadata : the data of the data product in many possible storages if needed but also the metadata (data on data) Infrastructure : you will need compute & storage but with the Serverless philisophy, we want to make it totally transparent and stay focus on the first two dimensions. What you have to code is this workflow !

Technology

Technology Architecture Google Cloud Metadata

Directory Tables : Access Unstructured Data

Cloudyard

MARCH 30, 2023

They are used to provide Snowflake access to unstructured datafiles and supports both of internal or external stage Query a directory table helps to retrieve the snowflake hosted file URL for each file present in stage. Directory tables metadata should be refreshed automatically when underlying stage gets updated.

Unstructured Data

Unstructured Data Accessible Accessibility Cloud Storage

Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed

Data Engineering Podcast

MARCH 18, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. Don't miss out on their only event this year! Data Council Logo]([link] Join us at the event for the global data community, Data Council Austin.

Data Security

Data Security Machine Learning Data Science Metadata

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

Unlike traditional planners that need to consider accessing a table via a variety of types of index, Impala’s planner always starts with a full table scan and then applies pruning techniques to reduce the data scanned. Metadata Caching. See the performance results below for an example of how metadata caching helps reduce latency.

Metadata

Metadata Coding SQL Database

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

Try Astro Free → Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Streamline code deployment, enhance collaboration, and ensure DevOps best practices with Astro's robust CI/CD capabilities.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

A Breakthrough AI-Powered SQL Assistant

Snowflake

APRIL 11, 2024

At Snowflake, we believe in making the power of data accessible to all. Not only do we have a unique vantage point into the challenges faced by data analysts, we also possess rich metadata that feeds into Snowflake’s dedicated text-to-SQL model that Copilot leverages in combination with Mistral’s technology.

SQL

SQL Data Analysis AWS High Quality Data

Netflix Information Security: Preventing Credential Compromise in AWS

Netflix Tech

NOVEMBER 28, 2018

Even with detection capabilities, there is a risk that exposed credentials can provide access to sensitive data and/or the ability to cause damage in our environment. Today, we would like to share two additional layers of security: API enforcement and metadata protection.

AWS

AWS Metadata Amazon Web Services Cloud

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA. Initially a self-service platform (Nuage 1.0), it transitioned to a decentralized model (Nuage 2.0)

Data Engineering

Data Engineering Data Engineer Engineering Data

Demystifying event streams: Transforming events into tables with dbt

dbt Developer Hub

NOVEMBER 3, 2022

Let’s discuss how to convert events from an event-driven microservice architecture into relational tables in a warehouse like Snowflake. Quality problems lead to first responders unable to check into disaster sites or parents unable to access ESA funds. So our solution was to start using an intentional contract: Events.

Kafka

Kafka ETL Tools BI Database

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Data Engineering Podcast

MAY 18, 2021

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data.

Metadata

Metadata Kafka Data Warehouse Hadoop

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Data Engineering Podcast

MARCH 27, 2022

Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Acryl]([link] The modern data stack needs a reimagined metadata management platform. Acryl]([link] The modern data stack needs a reimagined metadata management platform.

Data Governance

Data Governance Government Cloud Building

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix Tech

OCTOBER 8, 2024

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch Introduction As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming , the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital.

Bytes

Bytes Datasets Metadata Data

Snowflake Horizon Advances Industry-Leading Governance with Simplified Internal Marketplaces and AI Innovations

Snowflake

JUNE 5, 2024

At the same time, organizations must ensure the right people have access to the right content, while also protecting sensitive and/or Personally Identifiable Information (PII) and fulfilling a growing list of regulatory requirements. Additional built-in UI’s and privacy enhancements make it even easier to understand and manage sensitive data.

Government

Government Accessible Accessibility Cloud

Kubernetes Prometheus: Definition, Architecture, Pros & Cons

Knowledge Hut

JANUARY 2, 2024

Simply put, every item in a Kubernetes Prometheus store is a metric event that comes with a timestamp. Events are recorded in real time by Prometheus. Any event that is pertinent to your program can be included in this list, including memory usage, network activity, and specific inbound requests.

Architecture

Architecture Metadata Utilities Data Collection

What’s New in CDP Private Cloud Base 7.1.7?

Cloudera

AUGUST 10, 2021

Impala Row Filtering to set access policies for rows when reading from a table. Atlas / Kafka integration provides metadata collection for Kafa producers/consumers so that consumers can manage, govern, and monitor Kafka metadata and metadata lineage in the Atlas UI. Figure 1: sales group SELECT access.

Cloud

Cloud Kafka Metadata SQL

A Look Back at the Gartner Data and Analytics Summit

Cloudera

APRIL 18, 2024

Here are a couple of the biggest takeaways we had from our time at the event. In those discussions, it was clear that everyone understood the need to treat data estates more cohesively as a whole—that means bringing more attention to security, data governance, and metadata management, the latter of which has become increasingly popular.

Metadata

Metadata Government Data Governance Data

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix Tech

SEPTEMBER 18, 2024

Second, developers had to constantly re-learn new data modeling practices and common yet critical data access patterns. This abstraction simplifies data access, enhances the reliability of our infrastructure, and enables us to support the broad spectrum of use cases that Netflix demands with minimal developer effort.

Bytes

Bytes Metadata Database Data

How Netflix uses eBPF flow logs at scale for network insight

Netflix Tech

JUNE 7, 2021

By collecting, accessing and analyzing network data from a variety of sources like VPC Flow Logs , ELB Access Logs, eBPF flow logs on the instances, etc, we can provide network insight to users and central teams through multiple data visualization techniques like Lumen , Atlas , etc.

Transportation

Transportation AWS Cloud Kafka

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

Input : List of source tables and required processing mode Output : Psyberg identifies new events that have occurred since the last high watermark (HWM) and records them in the session metadata table. The session metadata table can then be read to determine the pipeline input. Audit Run various quality checks on the staged data.

Metadata

Metadata Data Pipeline Scala Data Process

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries. under varying load conditions as well as a wide variety of access patterns; (b) scalability?—?persisting

Media

Media Database Metadata Data Schemas

Version control APIs with Git integration

ThoughtSpot

JUNE 28, 2023

Getting started with version control APIs To begin, ensure that the following prerequisites are met: You have admin access (can administer Thoughtspot privilege) to connect ThoughtSpot to a Git repository and deploy commits. You have a Git repository and a branch that can be used as a default branch in ThoughtSpot.

Metadata

Metadata Accessibility Accessible Coding

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

This is particularly useful in environments where multiple applications need to access and process the same data. This configuration ensures that if the host goes down due to an EC2® event or any other reason, it will be automatically reprovisioned.

Kafka

Kafka MySQL Database Software Engineer

Understanding The Role Of The Chief Data Officer

Data Engineering Podcast

AUGUST 21, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Atlan is the metadata hub for your data ecosystem. And don’t forget to thank them for their continued support of this show!

Metadata

Metadata MongoDB MySQL Data Lake

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

In retrospect, complex SCD modeling techniques are not intuitive and reduce accessibility. Data engineers are also the “librarians” of the data warehouse, cataloging and organizing metadata, defining the processes by which one files or extract data from the warehouse.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Introducing Confluent Platform 5.2

Confluent

APRIL 2, 2019

the event streaming platform built by the original creators of Apache Kafka. What do we mean by contextual event-driven applications? Well, infrastructure to support event-driven applications has been around for decades; mere messaging is nothing new. Accelerate the development of contextual event-driven applications.

Kafka

Kafka Java Cloud Metadata

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Podcast

AUGUST 28, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Atlan is the metadata hub for your data ecosystem. And don’t forget to thank them for their continued support of this show!

Data Engineering

Data Engineering Data Engineer MongoDB Metadata

How Apache Iceberg Is Changing the Face of Data Lakes

Interesting startup idea: benchmarking cloud platform pricing

Webinars

Trending Sources

Foundation Model for Personalized Recommendation

Webinars

Metadata Management And Integration At LinkedIn With DataHub

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

How Meta discovers data flows via lineage at scale

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Introducing Impressions at Netflix

Simplifying Data Architecture and Security to Accelerate Value

Data Engineering Weekly #196

The last (but not least)”ops” you need for your data : DataGovops

Addressing The Challenges Of Component Integration In Data Platform Architectures

Sysmon Security Event Processing in Real Time with KSQL and HELK

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Ensuring the Successful Launch of Ads on Netflix

Metadata: What Is It and Why it Matters

How Cloudera Data Flow Enables Successful Data Mesh Architectures

New Snowflake Features Released in September–November 2023

Toward a Data Mesh (part 2) : Architecture & Technologies

Directory Tables : Access Unstructured Data

Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Data Engineering Weekly #209

A Breakthrough AI-Powered SQL Assistant

Netflix Information Security: Preventing Credential Compromise in AWS

Data Engineering Weekly #213

Demystifying event streams: Transforming events into tables with dbt

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Introducing Netflix TimeSeries Data Abstraction Layer

Snowflake Horizon Advances Industry-Leading Governance with Simplified Internal Marketplaces and AI Innovations

Kubernetes Prometheus: Definition, Architecture, Pros & Cons

What’s New in CDP Private Cloud Base 7.1.7?

A Look Back at the Gartner Data and Analytics Summit

Introducing Netflix’s Key-Value Data Abstraction Layer

How Netflix uses eBPF flow logs at scale for network insight

3. Psyberg: Automated end to end catch up

Implementing the Netflix Media Database

Version control APIs with Git integration

Change Data Capture at Pinterest

Understanding The Role Of The Chief Data Officer

The Rise of the Data Engineer

Introducing Confluent Platform 5.2

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Stay Connected