Data Ingestion, Events and Metadata - Data Engineering Digest

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Compute Engines: Tools that query and process data stored in Iceberg tables (e.g., Maintenance Processes: Operations that optimize Iceberg tables, such as compacting small files and managing metadata. Trino, Spark, Snowflake, DuckDB).

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

With Snowpipe for Apache Kafka (public preview soon in AWS and Microsoft Azure), a “pull” mechanism, rather than the existing “push” connector, allows you to extract and ingest Apache Kafka events into your Snowflake account directly without hosting your own Kafka Connect cluster.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA.

Data Engineering

Data Engineering Data Engineer Engineering Data

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Podcast

NOVEMBER 20, 2022

Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. In fact, while only 3.5%

Data Lake

Data Lake Data Ingestion MongoDB MySQL

Sysmon Security Event Processing in Real Time with KSQL and HELK

Confluent

FEBRUARY 21, 2019

During a recent talk titled Hunters ATT&CKing with the Right Data , which I presented with my brother Jose Luis Rodriguez at ATT&CKcon, we talked about the importance of documenting and modeling security event logs before developing any data analytics while preparing for a threat hunting engagement. Why KSQL and HELK?

Process

Process Kafka SQL Datasets

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

DE Zoomcamp 2.2.1 – Introduction to Workflow Orchestration Following last weeks blog , we move to data ingestion. We already had a script that downloaded a csv file, processed the data and pushed the data to postgres database. This week, we got to think about our data ingestion design.

Data Ingestion

Data Ingestion Data Engineering Data Engineer Engineering

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

WAP [Write-Audit-Publish] Pattern The WAP pattern follows a three-step process Write Phase The write phase results from a data ingestion or data transformation step. In the 'Write' stage, we capture the computed data in a log or a staging area. Event Routers typically don’t alter the payload.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

New Snowflake Features Released in January 2024

Snowflake

FEBRUARY 13, 2024

Snowpark Updates Model management with the Snowpark Model Registry – public preview Snowpark Model Registry is an integrated solution to register, manage and use models and their metadata natively in Snowflake. As a result, you should not rely on any forward-looking statements as predictions of future events. Learn more here.

Data Ingestion

Data Ingestion AWS Python Metadata

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

It also becomes the role of the data engineering team to be a “center of excellence” through the definitions of standards, best practices and certification processes for data objects. In a fast growing, rapidly evolving, slightly chaotic data ecosystem, metadata management and tooling become a vital component of a modern data platform.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Netflix Tech

MARCH 25, 2019

Therefore, the ingestion approach for data lineage is designed to work with many disparate data sources. Our data ingestion approach, in a nutshell, is classified broadly into two buckets?—?push We leverage Metacat data, our internal metadata store and service, to enrich lineage data with additional table metadata.

Building

Building Metadata Transportation Data Ingestion

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits. Sometimes Data Engineers write downstream ETLs on ingested data to optimize the data/metadata layouts to make other ETL processes cheaper and faster.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

Faster data ingestion: streaming ingestion pipelines. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities. Handle late-arriving data: How does my application detect and deal with streaming events that come out of order?

Kafka

Kafka Manufacturing Data Lake SQL

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve. NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries.

Media

Media Database Metadata Data Schemas

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

Application programming interfaces (APIs) are used to modify the retrieved data set for integration and to support users in keeping track of all the jobs. Users can schedule ETL jobs, and they can also choose the events that will trigger them. Then, Glue writes the job's metadata into the embedded AWS Glue Data Catalog.

AWS

AWS Scala Metadata Data Lake

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle. Having completed the Data Collection step in the previous blog, ECC’s next step in the data lifecycle is Data Enrichment.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Data Engineering Podcast

AUGUST 6, 2022

Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months.

Machine Learning

Machine Learning Database MySQL MongoDB

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

Today’s customers have a growing need for a faster end to end data ingestion to meet the expected speed of insights and overall business demand. This ‘need for speed’ drives a rethink on building a more modern data warehouse solution, one that balances speed with platform cost management, performance, and reliability.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

’ It assigns unique identifiers to each data item—referred to as ‘payloads’—related to each event. By offering real-time tracking mechanisms and sending targeted alerts to specific consumers, a Payload DJ can immediately notify them of any changes, delays, or issues affecting their data.

Insurance

Insurance Pharmaceutical Data Data Ingestion

The Data Integration Solution Checklist: Top 10 Considerations

Precisely

MAY 13, 2024

A true enterprise-grade integration solution calls for source and target connectors that can accommodate: VSAM files COBOL copybooks open standards like JSON modern platforms like Amazon Web Services ( AWS ), Confluent , Databricks , or Snowflake Questions to ask each vendor: Which enterprise data sources and targets do you support?

Data Integration

Data Integration Metadata Amazon Web Services Data Governance

Privacy Preserving Single Post Analytics

LinkedIn Engineering

DECEMBER 12, 2023

Less stringent privacy baselines have been proposed in the literature, such as event-level privacy which would consider the privacy of each new view in our setting. Pinot is a columnar OLAP store that serves analytics queries on data ingested from realtime streams.

Algorithm

Algorithm Metadata SQL Utilities

Monte Carlo’s New Fivetran Integration Accelerates Data Incident Detection, Resolution

Monte Carlo

APRIL 4, 2023

That’s why, in addition to integrating with your central data warehouse , lake , and lakehouse , Monte Carlo also integrates with transformation , orchestration , and now data ingestion tools. Now teams can instantly get full visibility into how these systems may be impacting their data assets, all in a single pane of glass.

BI

BI Data Ingestion Data Pipeline Metadata

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Cloudera

AUGUST 31, 2021

Running on CDW is fully integrated with streaming, data engineering, and machine learning analytics. It has a consistent framework that secures and provides governance for all data and metadata on private clouds, multiple public clouds, or hybrid clouds. Smart DwH Mover helps in accelerating data warehouse migration.

Data Warehouse

Data Warehouse Database-centric Metadata Cloud

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Data observability works with your data pipeline by providing insights into how your data flows and is processed from start to end. Here is a more detailed explanation of how data observability works within the data pipeline: Data ingestion : Observability begins from the point where data is ingested into the pipeline.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Engineering

Link Multiple Data Clouds to Ascend

Ascend.io

FEBRUARY 6, 2023

Data Service – is a group of Data Flows. At this level, users configure team members, connections to other systems, and event notifications. Data Flow – is an individual data pipeline. Data Flows include the ingestion of raw data, transformation via SQL and python, and sharing of finished data products.

Cloud

Cloud Data Ingestion Raw Data Data Pipeline

Link Multiple Data Clouds to Ascend

Ascend.io

FEBRUARY 6, 2023

Data Service – is a group of Data Flows. At this level, users configure team members, connections to other systems, and event notifications. Data Flow – is an individual data pipeline. Data Flows include the ingestion of raw data, transformation via SQL and python, and sharing of finished data products.

Cloud

Cloud Data Ingestion Raw Data Data Pipeline

Optimizing Kafka Clients: A Hands-On Guide

Rock the JVM

JANUARY 21, 2023

Giannis is a proud alumnus of Rock the JVM, working as a Solutions Architect with a focus on Event Streaming and Stream Processing Systems. Introduction Apache Kafka is a well-known event streaming platform used in many organizations worldwide. The producer will send some events to Kafka. EVENTS_TOPIC , event.

Kafka

Kafka Java Scala Coding

Sqoop vs. Flume Battle of the Hadoop ETL tools

ProjectPro

OCTOBER 28, 2015

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of scalability for processing petabytes of data. Data analysis using hadoop is just half the battle won. Getting data into the Hadoop cluster plays a critical role in any big data deployment. then you are on the right page.

ETL Tools

ETL Tools Hadoop Relational Database Unstructured Data

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

Instead of relying on traditional hierarchical structures and predefined schemas, as in the case of data warehouses, a data lake utilizes a flat architecture. This structure is made efficient by data engineering practices that include object storage. Watch our video explaining how data engineering works.

Data Lake

Data Lake Architecture IT Amazon Web Services

Real-Time Analytics in the World of Virtual Reality and Live Streaming

Rockset

SEPTEMBER 6, 2019

Virtual Reality – The Next Frontier in Media I work as a Data Engineer at a leading company in the VR space, with a mission to capture and transmit reality in perfect fidelity. Our content varies from on-demand experiences to live events like NBA games, comedy shows and music concerts.

Metadata

Metadata Kafka Data Cleanse SQL

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataKitchen

SEPTEMBER 21, 2023

In the contemporary data landscape, data teams commonly utilize data warehouses or lakes to arrange their data into L1, L2, and L3 layers. This existing paradigm fails to address the challenges and intricacies of “Data in Use.” ” For example, these tools may offer metadata-based notifications.

Raw Data

Raw Data Data Business Intelligence Data Engineering

New Snowflake Features Released in April 2023

Snowflake

MAY 22, 2023

Cross-Cloud Snowgrid Account Replication expands replication beyond databases – general availability Account Replication, now generally available, expands replication beyond databases to account metadata and integrations, making business continuity truly turnkey. Visit our documentation page to learn more. Learn more.

Healthcare

Healthcare Scala Medical Transportation

20+ Data Engineering Projects for Beginners with Source Code

ProjectPro

AUGUST 24, 2021

Data Engineering Project for Beginners If you are a newbie in data engineering and are interested in exploring real-world data engineering projects, check out the list of data engineering project examples below. This big data project discusses IoT architecture with a sample use case.

Data Engineering

Data Engineering Data Engineer Coding Project

Azure Databricks Architecture Overview

Edureka

AUGUST 19, 2024

Compute Plane: Processes data against serverless or classic compute resources. Workspace Storage Account: System data, notebooks, and logs are stored here. This storage account contains files and metadata associated with user workspaces, notebooks, and logs. It provides data versioning and transaction support.

Architecture

Architecture Data Lake Machine Learning Data Ingestion

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

3EJHjvm Once a business need is defined and a minimal viable product ( MVP ) is scoped, the data management phase begins with: Data ingestion: Data is acquired, cleansed, and curated before it is transformed. Feature engineering: Data is transformed to support ML model training. ML workflow, ubr.to/3EJHjvm

Engineering

Engineering Raw Data Data Science Machine Learning

Breaking Down Cost Barriers For Real-Time Change Data Capture (CDC)

Rockset

NOVEMBER 28, 2022

First, CDC theoretically allows companies to analyze and react to data in real time, as it’s generated. It works with existing streaming systems like Apache Kafka, Amazon Kinesis, and Azure Events Hubs, making it easier than ever to build a real-time data pipeline.

Data Warehouse

Data Warehouse MongoDB PostgreSQL SQL

Unstructured Data: Examples, Tools, Techniques, and Best Practices

AltexSoft

MAY 12, 2023

Tools and platforms for unstructured data management Unstructured data collection Unstructured data collection presents unique challenges due to the information’s sheer volume, variety, and complexity. The process requires extracting data from diverse sources, typically via APIs. Invest in data governance.

Unstructured Data

Unstructured Data NoSQL Hadoop Data Lake

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

It serves as a distributed processing engine for both categories of data streams: unbounded and bounded. Support for stream and batch processing, comprehensive state management, event-time processing semantics, and consistency guarantee for the state are just a few of Flink's capabilities.

Big Data

Big Data Project Metadata Programming Language

The Good and the Bad of the Elasticsearch Search and Analytics Engine

AltexSoft

SEPTEMBER 21, 2023

Analysis of logs, metrics, and security events. With Elasticsearch, you can aggregate and analyze large streams of logs, metrics, and security events in near real-time, making it indispensable for system monitoring and security information and event management (SIEM). Real-time behavior modeling with ML.

Engineering

Engineering NoSQL Programming Language Java

The Modern Data Stack: What It Is, How It Works, Use Cases, and Ways to Implement

AltexSoft

MARCH 14, 2023

Moreover, over 20 percent of surveyed companies were found to be utilizing 1,000 or more data sources to provide data to analytics systems. These sources commonly include databases, SaaS products, and event streams. Databases store key information that powers a company’s product, such as user data and product data.

IT

IT Data Warehouse Data Governance Data Lake

Data Collection for Machine Learning: Steps, Methods, and Best Practices

AltexSoft

JUNE 26, 2023

Read our article on Hotel Data Management to have a full picture of what information can be collected to boost revenue and customer satisfaction in hospitality. While all three are about data acquisition, they have distinct differences. They dictate how data is gathered, stored, and processed and who has access to what content.

Data Collection

Data Collection Machine Learning Unstructured Data Non-relational Database

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Monte Carlo

JUNE 14, 2023

Why is data pipeline architecture important? Databricks – Databricks, the Apache Spark-as-a-service platform, has pioneered the data lakehouse, giving users the options to leverage both structured and unstructured data and offers the low-cost storage features of a data lake.

Data Pipeline

Data Pipeline Architecture Data Lake Data Warehouse

What is a Data Platform? And How to Build An Awesome One

Monte Carlo

AUGUST 19, 2023

We’ll cover: What is a data platform? Databricks – Databricks, the Apache Spark-as-a-service platform, has pioneered the data lakehouse, giving users the options to leverage both structured and unstructured data and offers the low-cost storage features of a data lake.

Building

Building BI Data Lake Data Governance

50 PySpark Interview Questions and Answers For 2023

ProjectPro

NOVEMBER 22, 2021

StructType is a collection of StructField objects that determines column name, column data type, field nullability, and metadata. To define the columns, PySpark offers the pyspark.sql.types import StructField class, which has the column name (String), column type (DataType), nullable column (Boolean), and metadata (MetaData).

Hadoop

Hadoop Python Datasets Metadata

Level Up Your Data Platform With Active Metadata

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Webinars

Trending Sources

Simplifying Data Architecture and Security to Accelerate Value

Webinars

Data Engineering Weekly #213

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Sysmon Security Event Processing in Real Time with KSQL and HELK

Data Engineering Zoomcamp – Data Ingestion (Week 2)

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

New Snowflake Features Released in January 2024

The Rise of the Data Engineer

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Optimizing data warehouse storage

Turning Streams Into Data Products

Implementing the Netflix Media Database

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Next Stop – Building a Data Pipeline from Edge to Insight

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

The Need For Personalized Data Journeys for Your Data Consumers

The Data Integration Solution Checklist: Top 10 Considerations

Privacy Preserving Single Post Analytics

Monte Carlo’s New Fivetran Integration Accelerates Data Incident Detection, Resolution

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Data Pipeline Observability: A Model For Data Engineers

Link Multiple Data Clouds to Ascend

Link Multiple Data Clouds to Ascend

Optimizing Kafka Clients: A Hands-On Guide

Sqoop vs. Flume Battle of the Hadoop ETL tools

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Real-Time Analytics in the World of Virtual Reality and Live Streaming

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

New Snowflake Features Released in April 2023

20+ Data Engineering Projects for Beginners with Source Code

Azure Databricks Architecture Overview

Data Vault on Snowflake: Feature Engineering and Business Vault

Breaking Down Cost Barriers For Real-Time Change Data Capture (CDC)

Unstructured Data: Examples, Tools, Techniques, and Best Practices

20 Best Open Source Big Data Projects to Contribute on GitHub

The Good and the Bad of the Elasticsearch Search and Analytics Engine

The Modern Data Stack: What It Is, How It Works, Use Cases, and Ways to Implement

Data Collection for Machine Learning: Steps, Methods, and Best Practices

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

What is a Data Platform? And How to Build An Awesome One

50 PySpark Interview Questions and Answers For 2023

Stay Connected