Data Ingestion, Data Pipeline and Metadata

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

Below is the entire set of steps in the data lifecycle, and each step in the lifecycle will be supported by a dedicated blog post(see Fig. 1): Data Collection – data ingestion and monitoring at the edge (whether the edge be industrial sensors or people in a vehicle showroom).

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Data Pipeline Observability: A Model For Data Engineers Eitan Chazbani June 29, 2023 Data pipeline observability is your ability to monitor and understand the state of a data pipeline at any time. We believe the world’s data pipelines need better data observability.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Engineering

Webinars

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Data stacks are becoming more and more complex. In fact, while only 3.5% In fact, while only 3.5%

Metadata

Metadata MongoDB MySQL Scala

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Podcast

NOVEMBER 20, 2022

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. What is the workflow for someone getting Sifflet integrated into their data stack?

Data Lake

Data Lake Data Ingestion MongoDB MySQL

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

The author emphasizes the importance of mastering state management, understanding "local first" data processing (prioritizing single-node solutions before distributed systems), and leveraging an asset graph approach for data pipelines. link] Grab: Improving Hugo's stability and addressing oncall challenges through automation.

Data Engineering

Data Engineering Data Engineer Engineering Data

Improved Ascend for Databricks, New Lineage Visualization, and Better Incremental Data Ingestion

Ascend.io

DECEMBER 19, 2022

We hope the real-time demonstrations of Ascend automating data pipelines were a real treat—a long with the special edition T-Shirt designed specifically for the show (picture of our founder and CEO rocking the t-shirt below). Instead, it is a Sankey diagram driven by the same dynamic metadata that runs the Ascend control plane.

Data Ingestion

Data Ingestion Data Pipeline Metadata AWS

Scalable Annotation Service?—?Marken

Netflix Tech

JANUARY 25, 2023

Scalable Annotation Service — Marken by Varun Sekhri , Meenakshi Jindal Introduction At Netflix, we have hundreds of micro services each with its own data models or entities. For example, we have a service that stores a movie entity’s metadata or a service that stores metadata about images. In this case it is BOUNDING_BOX.

Algorithm

Algorithm Media Metadata Data Ingestion

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

We all know that data freshness plays a critical role in the performance of Lakehouse. If we can place the metadata, indexing, and recent data files in Express One, we can potentially build a Snowflake-style performant architecture in Lakehouse. Apache Hudi, for example, introduces an indexing technique to Lakehouse.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

I won’t bore you with the importance of data quality in the blog. Instead, Let’s examine the current data pipeline architecture and ask why data quality is expensive. Instead of looking at the implementation of the data quality frameworks, Let's examine the architectural patterns of the data pipeline.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Data Pipeline Optimization: How to Reduce Costs with Ascend

Ascend.io

MAY 8, 2023

The costs of developing and running data pipelines are coming under increasing scrutiny because the bills for infrastructure and data engineering talent are piling up. For data teams, it is time to ask: “How can we have an impact on these runaway costs and still deliver unprecedented business value?”

Data Pipeline

Data Pipeline Data Ingestion Metadata Data

Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

Data Engineering Podcast

JUNE 5, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos.

Data Security

Data Security Metadata MongoDB MySQL

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Netflix Tech

MARCH 25, 2019

You are about to make structural changes to the data and want to know who and what downstream to your service will be impacted. Finally, imagine yourself in the role of a data platform reliability engineer tasked with providing advanced lead time to data pipeline (ETL) owners by proactively identifying issues upstream to their ETL jobs.

Building

Building Metadata Transportation Data Ingestion

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

Experience Enterprise-Grade Apache Airflow Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more. Hudi seems to be a de facto choice for CDC data lake features. Notion migrated the insert heavy workload from Snowflake to Hudi.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Modern Data Engineering

Towards Data Science

NOVEMBER 4, 2023

I’d like to discuss some popular Data engineering questions: Modern data engineering (DE). Does your DE work well enough to fuel advanced data pipelines and Business intelligence (BI)? Are your data pipelines efficient? and parallel data processing. What is it? ML model training using Airflow.

Data Engineering

Data Engineering Data Engineer Engineering BI

Recognizing Organizations Leading the Way in Data Security & Governance

Cloudera

DECEMBER 20, 2021

In the past year, the Bank of the West has begun using the Cloudera platform to establish a data governance and security framework to manage and protect its customers’ sensitive information. The platform is centralizing the data, data management & governance, and building custom controls for data ingestion into the system.

Government

Government Data Security Banking Metadata

What Is Data Pipeline Automation?

Ascend.io

MARCH 17, 2023

These engineering functions are almost exclusively concerned with data pipelines, spanning ingestion, transformation, orchestration, and observation — all the way to data product delivery to the business tools and downstream applications. Pipelines need to grow faster than the cost to run them.

Data Pipeline

Data Pipeline Datasets Data Software Engineer

What Is Data Pipeline Automation?

Ascend.io

MARCH 17, 2023

These engineering functions are almost exclusively concerned with data pipelines, spanning ingestion, transformation, orchestration, and observation — all the way to data product delivery to the business tools and downstream applications. Pipelines need to grow faster than the cost to run them.

Data Pipeline

Data Pipeline Datasets Data Software Engineer

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Monte Carlo

JUNE 14, 2023

In this post, we will help you quickly level up your overall knowledge of data pipeline architecture by reviewing: Table of Contents What is data pipeline architecture? Why is data pipeline architecture important? What is data pipeline architecture? Why is data pipeline architecture important?

Data Pipeline

Data Pipeline Architecture Data Lake Data Warehouse

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Then here a list of global resources that can help you navigate through the field: The Data Engineer Roadmap — An image with advices and technology names to watch. Reddit r/dataengineering wiki a place where some data eng definitions are written. workflows (Airflow, Prefect, Dagster, etc.) Is it really modern?

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

Customers who have chosen Google Cloud as their cloud platform can now use CDP Public Cloud to create secure governed data lakes in their own cloud accounts and deliver security, compliance and metadata management across multiple compute clusters. Data Preparation (Apache Spark and Apache Hive) .

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Cloudera delivers an enterprise data cloud that enables companies to build end-to-end data pipelines for hybrid cloud, spanning edge devices to public or private cloud, with integrated security and governance underpinning it to protect customers data. Data Science and machine learning workloads using CDSW.

Cloud

Cloud Kafka Professional Services Metadata

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

CSP was recently recognized as a leader in the 2022 GigaOm Radar for Streaming Data Platforms report. Faster data ingestion: streaming ingestion pipelines. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities.

Kafka

Kafka Manufacturing Data Lake SQL

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Since we announced the general availability of Apache Iceberg in Cloudera Data Platform (CDP), Cloudera customers, such as Teranet , have built open lakehouses to future-proof their data platforms for all their analytical workloads. Only metadata will be regenerated. Data quality using table rollback.

Cloud

Cloud Metadata Data Warehouse Google Cloud

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits. Sometimes Data Engineers write downstream ETLs on ingested data to optimize the data/metadata layouts to make other ETL processes cheaper and faster.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Data Engineering Podcast

AUGUST 6, 2022

Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. What are some of the data management considerations that are introduced by vector databases?

Machine Learning

Machine Learning Database MySQL PostgreSQL

Monte Carlo’s New Fivetran Integration Accelerates Data Incident Detection, Resolution

Monte Carlo

APRIL 4, 2023

That’s why, in addition to integrating with your central data warehouse , lake , and lakehouse , Monte Carlo also integrates with transformation , orchestration , and now data ingestion tools. Now teams can instantly get full visibility into how these systems may be impacting their data assets, all in a single pane of glass.

BI

BI Data Ingestion Data Pipeline Metadata

Link Multiple Data Clouds to Ascend

Ascend.io

FEBRUARY 6, 2023

Data Flow – is an individual data pipeline. Data Flows include the ingestion of raw data, transformation via SQL and python, and sharing of finished data products. Data Plane – is the data cloud where the data pipeline workload runs, like Databricks, BigQuery, and Snowflake.

Cloud

Cloud Data Ingestion Raw Data Data Pipeline

Link Multiple Data Clouds to Ascend

Ascend.io

FEBRUARY 6, 2023

Data Flow – is an individual data pipeline. Data Flows include the ingestion of raw data, transformation via SQL and python, and sharing of finished data products. Data Plane – is the data cloud where the data pipeline workload runs, like Databricks, BigQuery, and Snowflake.

Cloud

Cloud Data Ingestion Raw Data Data Pipeline

TensorFlow Transform: Ensuring Seamless Data Preparation in Production

Towards Data Science

JULY 8, 2024

Leveraging TensorFlow Transform for scaling data pipelines for production environments Photo by Suzanne D. Williams on Unsplash Data pre-processing is one of the major steps in any Machine Learning pipeline. ML Pipeline operations begins with data ingestion and validation, followed by transformation.

Data Preparation

Data Preparation Datasets Metadata Data Ingestion

Redefining Search and Analytics for the AI Era

Rockset

AUGUST 28, 2023

Query across your ANN indexes on vector embeddings, and your JSON and geospatial “metadata” fields efficiently. Spin a Virtual Instance for streaming data ingestion. If you know SQL, you already know how to use Rockset. We obsess about efficiency in the cloud. Spin another completely isolated Virtual Instance for your app.

Metadata

Metadata Unstructured Data SQL Database

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

DataOps is a collaborative approach to data management that combines the agility of DevOps with the power of data analytics. It aims to streamline data ingestion, processing, and analytics by automating and integrating various data workflows.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

The Solution: ‘Payload’ Data Journeys Traditional Data Observability usually focuses on a ‘process journey,’ tracking the performance and status of data pipelines. ’ It assigns unique identifiers to each data item—referred to as ‘payloads’—related to each event.

Insurance

Insurance Pharmaceutical Data Data Ingestion

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Databand.ai

AUGUST 30, 2023

DataOps , short for data operations, is an emerging discipline that focuses on improving the collaboration, integration, and automation of data processes across an organization. These tools help organizations implement DataOps practices by providing a unified platform for data teams to collaborate, share, and manage their data assets.

Data Cleanse

Data Cleanse Data Pipeline Data Ingestion Data Validation

The Data Integration Solution Checklist: Top 10 Considerations

Precisely

MAY 13, 2024

A true enterprise-grade integration solution calls for source and target connectors that can accommodate: VSAM files COBOL copybooks open standards like JSON modern platforms like Amazon Web Services ( AWS ), Confluent , Databricks , or Snowflake Questions to ask each vendor: Which enterprise data sources and targets do you support?

Data Integration

Data Integration Metadata Amazon Web Services Data Governance

Accelerate your Data Migration to Snowflake

RandomTrees

SEPTEMBER 6, 2020

The architecture is three layered: Database Storage: Snowflake has a mechanism to reorganize the data into its internal optimized, compressed and columnar format and stores this optimized data in cloud storage. This stage handles all the aspects of data storage like organization, file size, structure, compression, metadata, statistics.

Cloud Storage

Cloud Storage Data Ingestion Data Cleanse Data Warehouse

Unleash the Power of SCD2 with Finalizer Tasks

Cloudyard

JULY 16, 2024

Read Time: 3 Minute, 11 Second This blog post showcases a real-time data pipeline built in Snowflake that leverages Slowly Changing Dimensions (SCD 2) and Finalizer Tasks to ensure your customer data is always fresh, accurate, and reflects historical changes. Snowflake’s Finalizer Tasks come to the rescue!

Data Ingestion

Data Ingestion Data Pipeline Metadata Utilities

Data Engineering Weekly #105

Data Engineering Weekly

OCTOBER 30, 2022

Data Engineering Weekly Is Brought to You by RudderStack RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Data Engineering

Data Engineering Data Engineer Engineering Data Ingestion

Creating Value With a Data-Centric Culture: Essential Capabilities to Treat Data as a Product

Ascend.io

JUNE 8, 2023

The Essential Six Capabilities To set the stage for impactful and trustworthy data products in your organization, you need to invest in six foundational capabilities. Data pipelines Data integrity Data lineage Data stewardship Data catalog Data product costing Let’s review each one in detail.

Pipeline-centric

Pipeline-centric Database-centric Data Ingestion Data Pipeline

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Monte Carlo

NOVEMBER 14, 2023

Databricks announced that Delta tables metadata will also be compatible with the Iceberg format, and Snowflake has also been moving aggressively to integrate with Iceberg. How Apache Iceberg tables structure metadata. I think it’s safe to say it’s getting pretty cold in here. Image courtesy of Dremio. So, is Iceberg right for you?

Data Lake

Data Lake Metadata Data Warehouse SQL

Large Scale Ad Data Systems at Booking.com using the Public Cloud

Booking.com Engineering

DECEMBER 2, 2022

From data ingestion, data science, to our ad bidding[2], GCP is an accelerant in our development cycle, sometimes reducing time-to-market from months to weeks. Data Ingestion and Analytics at Scale Ingestion of performance data, whether generated by a search provider or internally, is a key input for our algorithms.

Systems

Systems Cloud MySQL Relational Database

How to Build an End to End Machine Learning Pipeline?

ProjectPro

FEBRUARY 25, 2022

Efficient Scheduling and Runtime Increased Adaptability and Scope Faster Analysis and Real-Time Prediction Introduction to the Machine Learning Pipeline Architecture How to Build an End-to-End a Machine Learning Pipeline? Is python suitable for machine learning pipeline design patterns?

Machine Learning

Machine Learning Building Amazon Web Services AWS

Dialing Down The Dollars: Quantify and Control Your Data Costs

Ascend.io

JUNE 21, 2023

In the case of data products, these are networks of data pipelines, which makes them an integral part of your modern operational machinery. To determine the ROI of any particular data product, you need to attribute the costs of building and running data pipelines.

Pipeline-centric

Pipeline-centric Data Pipeline Metadata Data

Data Engineering Glossary

Silectis

JANUARY 3, 2021

Data Catalog An organized inventory of data assets relying on metadata to help with data management. Data Engineering Data engineering is a process by which data engineers make data useful. Data Integration Combining data from various, disparate sources into one unified view.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Level Up Your Data Platform With Active Metadata

Next Stop – Building a Data Pipeline from Edge to Insight

Webinars

Trending Sources

Data Pipeline Observability: A Model For Data Engineers

Webinars

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Weekly #213

Improved Ascend for Databricks, New Lineage Visualization, and Better Incremental Data Ingestion

Scalable Annotation Service?—?Marken

Data Engineering Weekly #217

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Pipeline Optimization: How to Reduce Costs with Ascend

Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Data Engineering Weekly #179

Modern Data Engineering

Recognizing Organizations Leading the Way in Data Security & Governance

What Is Data Pipeline Automation?

What Is Data Pipeline Automation?

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

How to learn data engineering

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Upgrade Journey: The Path from CDH to CDP Private Cloud

Turning Streams Into Data Products

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Optimizing data warehouse storage

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Monte Carlo’s New Fivetran Integration Accelerates Data Incident Detection, Resolution

Link Multiple Data Clouds to Ascend

Link Multiple Data Clouds to Ascend

TensorFlow Transform: Ensuring Seamless Data Preparation in Production

Redefining Search and Analytics for the AI Era

DataOps Architecture: 5 Key Components and How to Get Started

The Need For Personalized Data Journeys for Your Data Consumers

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

The Data Integration Solution Checklist: Top 10 Considerations

Accelerate your Data Migration to Snowflake

Unleash the Power of SCD2 with Finalizer Tasks

Data Engineering Weekly #105

Creating Value With a Data-Centric Culture: Essential Capabilities to Treat Data as a Product

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Large Scale Ad Data Systems at Booking.com using the Public Cloud

How to Build an End to End Machine Learning Pipeline?

Dialing Down The Dollars: Quantify and Control Your Data Costs

Data Engineering Glossary

Stay Connected