Data, Data Ingestion and Metadata - Data Engineering Digest

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB Scala MySQL

Data Quality Testing: A Shared Resource for Modern Data Teams

DataKitchen

JUNE 6, 2025

Data Quality Testing: A Shared Resource for Modern Data Teams In today’s AI-driven landscape, where data is king, every role in the modern data and analytics ecosystem shares one fundamental responsibility: ensuring that incorrect data never reaches business customers. Each role touches data differently.

Data Ingestion

Data Ingestion Government Data Governance Data

Snowflake Unistore: Hybrid Tables Now Generally Available

Snowflake

NOVEMBER 12, 2024

These organizations and many more are using Hybrid Tables to simplify their data architectures and governance and security by consolidating transactional and analytical workloads onto Snowflake's single unified data platform. Roofstock, a leading investment platform for single-family rentals, is seeing these benefits firsthand. "We

Food

Food Metadata Education Data Architect

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

The modern data stack constantly evolves, with new technologies promising to solve age-old problems like scalability, cost, and data silos. It promised to address key pain points: Scaling: Handling ever-increasing data volumes. Speed: Accelerating data insights. Data Silos: Breaking down barriers between data sources.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Data Engineering Weekly #222

Data Engineering Weekly

JUNE 1, 2025

Join Dagster and Neurospace to learn: - How to build AI pipelines with orchestration baked in - How to track data lineage for audits and traceability - Tips for designing compliant workflows under the EU AI Act Register for the technical session DuckDB: DuckLake - SQL as a Lakehouse Format DuckDB announced a new open table format, DuckLake.

Data Engineering

Data Engineering Data Engineer Engineering Relational Database

50+ Azure Data Factory Interview Questions and Answers [2025]

ProjectPro

JUNE 6, 2025

Discover 50+ Azure Data Factory interview questions and answers for all experience levels. A report by ResearchAndMarkets projects the global data integration market size to grow from USD 12.24 A report by ResearchAndMarkets projects the global data integration market size to grow from USD 12.24 billion in 2020 to USD 24.84

Data Lake

Data Lake Metadata SQL Datasets

How to Build a Data Lake?

ProjectPro

JUNE 6, 2025

This guide is your roadmap to building a data lake from scratch. We'll break down the fundamentals, walk you through the architecture, and share actionable steps to set up a robust and scalable data lake. Traditional data storage systems like data warehouses were designed to handle structured and preprocessed data.

Data Lake

Data Lake Building Hadoop Raw Data

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

It’s easy these days for an organization’s data infrastructure to begin looking like a maze, with an accumulation of point solutions here and there. Snowflake is committed to doing just that by continually adding features to help our customers simplify how they architect their data infrastructure. Here’s a closer look.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Summary Data is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. Data stacks are becoming more and more complex. Sifflet also offers a 2-week free trial.

Metadata

Metadata MongoDB Scala MySQL

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

JUNE 6, 2025

Do ETL and data integration activities seem complex to you? Read this blog to understand everything about AWS Glue that makes it one of the most popular data integration solutions in the industry. Did you know the global big data market will likely reach $268.4 Businesses are leveraging big data now more than ever.

AWS

AWS Scala Metadata Data Lake

How to Build an End to End Machine Learning Pipeline?

ProjectPro

JUNE 6, 2025

A machine learning pipeline helps automate machine learning workflows by processing and integrating data sets into a model, which can then be evaluated and delivered. The machine learning pipeline offers data scientists a way to handle data for training, orchestrate models, and monitor them in deployment.

Machine Learning

Machine Learning Building Amazon Web Services Deep Learning

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

Architecture

Architecture Systems Data Lake Google Cloud

How To Build A Batch Data Pipeline?

ProjectPro

JUNE 6, 2025

Building a batch pipeline is essential for processing large volumes of data efficiently and reliably. Are you ready to step into the heart of big data projects and take control of data like a pro? Are you ready to step into the heart of big data projects and take control of data like a pro?

Data Pipeline

Data Pipeline Building Retail Data Ingestion

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA. These are common LinkedIn requests. The article resonated with me when I read it.

Data Engineering

Data Engineering Data Engineer Engineering Data Ingestion

10+ Top Data Pipeline Tools to Streamline Your Data Journey

ProjectPro

JUNE 6, 2025

Today, data engineers are constantly dealing with a flood of information and the challenge of turning it into something useful. The journey from raw data to meaningful insights is no walk in the park. It requires a skillful blend of data engineering expertise and the strategic use of tools designed to streamline this process.

Data Pipeline

Data Pipeline Google Cloud Kafka AWS

Manufacturing Data Ingestion into Snowflake

Snowflake

JANUARY 26, 2023

Accessing data from the manufacturing shop floor is one of the key topics of interest with the majority of cloud platform vendors due to the pace of Industry 4.0 practices is the ability to collect and analyze vast amounts of data, allowing for improved efficiency, accuracy, and decision-making. Industry 4.0, cannot be overstated.

Data Ingestion

Data Ingestion Manufacturing Unstructured Data Data

How To Use Airbyte, dbt-teradata, Dagster, and Teradata Vantage™ for Seamless Data Integration

Teradata

MAY 30, 2025

Register now Home Insights Data platform Article How To Use Airbyte, dbt-teradata, Dagster, and Teradata Vantage™ for Seamless Data Integration Build and orchestrate a data pipeline in Teradata Vantage using Airbyte, Dagster, and dbt. In this case, we select Sample Data (Faker). dbt-core dagster==1.7.9

Data Integration

Data Integration Raw Data Metadata Data Pipeline

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

JUNE 6, 2025

The adaptability and technical superiority of such open-source big data projects make them stand out for community use. As per the surveyors, Big data (35 percent), Cloud computing (39 percent), operating systems (33 percent), and the Internet of Things (31 percent) are all expected to be impacted by open source shortly.

Big Data

Big Data Project Metadata Programming Language

How to Learn AWS for Data Engineering?

ProjectPro

JUNE 6, 2025

Becoming a successful aws data engineer demands you to learn AWS for data engineering and leverage its various services for building efficient business applications. million organizations that want to be data-driven choose AWS as their cloud services partner. Table of Contents Why Learn AWS for Data Engineering?

AWS

AWS Data Engineering Data Engineer Engineering

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Podcast

NOVEMBER 20, 2022

Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.

Data Lake

Data Lake MongoDB Data Ingestion Scala

50 PySpark Interview Questions and Answers For 2025

ProjectPro

JUNE 6, 2025

With the global data volume projected to surge from 120 zettabytes in 2023 to 181 zettabytes by 2025, PySpark's popularity is soaring as it is an essential tool for efficient large scale data processing and analyzing vast datasets. Resilient Distributed Datasets (RDDs) are the fundamental data structure in Apache Spark.

Hadoop

Hadoop Metadata Java Datasets

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

DE Zoomcamp 2.2.1 – Introduction to Workflow Orchestration Following last weeks blog , we move to data ingestion. We already had a script that downloaded a csv file, processed the data and pushed the data to postgres database. This week, we got to think about our data ingestion design.

Data Ingestion

Data Ingestion Data Engineering Data Engineer Engineering

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Ozone natively provides Amazon S3 and Hadoop Filesystem compatible endpoints in addition to its own native object store API endpoint and is designed to work seamlessly with enterprise scale data warehousing, machine learning and streaming workloads. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Cloud Hadoop Metadata

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

[link] Jing Ge: Context Matters — The Vision of Data Analytics and Data Science Leveraging MCP and A2A All aspects of software engineering are rapidly being automated with various coding AI tools, as seen in the AI technology radar. Data engineering is one aspect where I see a few startups starting to disrupt.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

7 Best Data Warehousing Tools for Efficient Data Storage Needs

ProjectPro

JUNE 6, 2025

Data is often referred to as the new oil, and just like oil requires refining to become useful fuel, data also needs a similar transformation to unlock its true value. This transformation is where data warehousing tools come into play, acting as the refining process for your data. Why Choose a Data Warehousing Tool?

Data Storage

Data Storage PostgreSQL Data Warehouse AWS

Microsoft Fabric Architecture Explained: Core Components & Benefit

Edureka

MAY 27, 2025

Microsoft Fabric is a next-generation data platform that combines business intelligence, data warehousing, real-time analytics, and data engineering into a single integrated SaaS framework. The architecture of Microsoft Fabric is based on several essential elements that work together to simplify data processes: 1.

Architecture

Architecture BI Business Intelligence Data Lake

How to Build RAG Pipelines for LLM Projects?

ProjectPro

JUNE 6, 2025

It discusses the RAG architecture, outlining key stages like data ingestion , data retrieval, chunking , embedding generation , and querying. With step-by-step examples, you'll learn to integrate data from text files and PDFs while leveraging embeddings for precision.

Building

Building Project Metadata Data Ingestion

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Netflix Tech

MARCH 25, 2019

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency By: Di Lin , Girish Lingappa , Jitender Aswani Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question?—?“Can

Building

Building Metadata Transportation Data Ingestion

What is Apache Iceberg: Features, Architecture & Use Cases

ProjectPro

JUNE 6, 2025

Explore what is Apache Iceberg, what makes it different, and why it’s quickly becoming the new standard for data lake analytics. Data lakes were born from a vision to democratize data, enabling more people, tools, and applications to access a wider range of data. Metadata Layer 3. Workarounds became the norm.

Architecture

Architecture Data Lake Metadata Cloud Storage

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

By the time I left in 2013, I was a data engineer. We were data engineers! Data Engineering? Data science as a discipline was going through its adolescence of self-affirming and defining itself. At the same time, data engineering was the slightly younger sibling, but it was going through something similar.

Data Engineering

Data Engineering Data Engineer Engineering Database-centric

What is Retrieval Augmented Generation (RAG) Architecture?

ProjectPro

JUNE 6, 2025

By incorporating external data, it ensures better accuracy and relevance. A RAG and LLM architecture enhances the performance of LLMs like GPT or Llama2 , by integrating external data retrieval to produce more accurate and context-aware responses. These vectors are stored in a vector database for quick retrieval.

Architecture

Architecture Data Ingestion Google Cloud AWS

5 AWS Glue Use Cases and Examples That Showcase Its Power

ProjectPro

JUNE 6, 2025

Did you know over 5140 businesses worldwide started using AWS Glue as a big data tool in 2023? With the rapid growth of data in the industry, businesses often deal with several challenges when handling complex processes such as data integration and analytics.

AWS

AWS IT Data Lake BI

Improved Ascend for Databricks, New Lineage Visualization, and Better Incremental Data Ingestion

Ascend.io

DECEMBER 19, 2022

We hope the real-time demonstrations of Ascend automating data pipelines were a real treat—a long with the special edition T-Shirt designed specifically for the show (picture of our founder and CEO rocking the t-shirt below). With this approach, we’re able to augment our uniquely beautiful and intuitive visualization of data pipelines.

Data Ingestion

Data Ingestion Metadata Data Pipeline AWS

Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

Data Engineering Podcast

JUNE 5, 2022

Summary The best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems.

Data Security

Data Security Metadata MongoDB Scala

The Ultimate 101 Guide to Apache Airflow DAGS

ProjectPro

JUNE 6, 2025

Looking for an efficient tool for streamlining and automating your data processing workflows? Let's consider an example of a data processing pipeline that involves ingesting data from various sources, cleaning it, and then performing analysis. Apache Airflow DAGs are your one-stop solution!

Data Pipeline

Data Pipeline PostgreSQL Python Database

Top 10 Data Engineering Tools You Must Learn in 2025

ProjectPro

JUNE 6, 2025

This blog post provides an overview of the top 10 data engineering tools for building a robust data architecture to support smooth business operations. Table of Contents What are Data Engineering Tools? Dice Tech Jobs report 2020 indicates Data Engineering is one of the highest in-demand jobs worldwide.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

The Ultimate Guide to Getting Started with AWS Athena in 2025

ProjectPro

JUNE 6, 2025

Cloud computing is the future, given that the data being produced and processed is increasing exponentially. As per the March 2022 report by statista.com, the volume for global data creation is likely to grow to more than 180 zettabytes over the next five years, whereas it was 64.2 Is AWS Athena a Good Choice for your Big Data Project?

AWS

AWS Big Data SQL Raw Data

Snowflake’s Best-in-Class Enterprise Data Foundation Unlocks Interoperability with Open Data and Internal Collaboration

Snowflake

JUNE 4, 2024

Snowflake provides a strong data foundation anchored on unified data, optimal TCO and universal governance. The Snowflake platform eliminates silos to enable any architectural pattern, while supporting all data types and workloads. These capabilities can even be extended to Iceberg tables created by other engines.

Government

Government Data Ingestion PostgreSQL Data

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

Experience Enterprise-Grade Apache Airflow Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more. Hudi seems to be a de facto choice for CDC data lake features. Notion migrated the insert heavy workload from Snowflake to Hudi.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

100+ Big Data Interview Questions and Answers 2025

ProjectPro

JUNE 6, 2025

If you're looking to break into the exciting field of big data or advance your big data career, being well-prepared for big data interview questions is essential. Get ready to expand your knowledge and take your big data career to the next level! “Data analytics is the future, and the future is NOW!

Big Data

Big Data Hadoop Relational Database AWS

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Learn data engineering, all the references ( credits ) This is a special edition of the Data News. But right now I'm in holidays finishing a hiking week in Corsica 🥾 So I wrote this special edition about: how to learn data engineering in 2024. The idea is to create a living reference about Data Engineering.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Scalable Annotation Service?—?Marken

Netflix Tech

JANUARY 25, 2023

Scalable Annotation Service — Marken by Varun Sekhri , Meenakshi Jindal Introduction At Netflix, we have hundreds of micro services each with its own data models or entities. For example, we have a service that stores a movie entity’s metadata or a service that stores metadata about images. Allows to annotate any entity.

Algorithm

Algorithm Metadata Media Data Ingestion

How to Transition from ETL Developer to Data Engineer?

ProjectPro

JUNE 6, 2025

In the thought process of making a career transition from ETL developer to data engineer job roles? Read this blog to know how various data-specific roles, such as data engineer, data scientist, etc., differ from ETL developer and the additional skills you need to transition from ETL developer to data engineer job roles.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

The promise of a modern data lakehouse architecture. Imagine having self-service access to all business data, anywhere it may be, and being able to explore it all at once. Imagine quickly answering burning business questions nearly instantly, without waiting for data to be found, shared, and ingested.

Architecture

Architecture Metadata Machine Learning Unstructured Data

Level Up Your Data Platform With Active Metadata

Data Quality Testing: A Shared Resource for Modern Data Teams

Webinars

Trending Sources

Snowflake Unistore: Hybrid Tables Now Generally Available

Webinars

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly #222

50+ Azure Data Factory Interview Questions and Answers [2025]

How to Build a Data Lake?

Simplifying Data Architecture and Security to Accelerate Value

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

How to Build an End to End Machine Learning Pipeline?

Why Open Table Format Architecture is Essential for Modern Data Systems

How To Build A Batch Data Pipeline?

Data Engineering Weekly #213

10+ Top Data Pipeline Tools to Streamline Your Data Journey

Manufacturing Data Ingestion into Snowflake

How To Use Airbyte, dbt-teradata, Dagster, and Teradata Vantage™ for Seamless Data Integration

20 Best Open Source Big Data Projects to Contribute on GitHub

How to Learn AWS for Data Engineering?

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

50 PySpark Interview Questions and Answers For 2025

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Apache Ozone Powers Data Science in CDP Private Cloud

Data Engineering Weekly #217

7 Best Data Warehousing Tools for Efficient Data Storage Needs

Microsoft Fabric Architecture Explained: Core Components & Benefit

How to Build RAG Pipelines for LLM Projects?

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

What is Apache Iceberg: Features, Architecture & Use Cases

The Rise of the Data Engineer

What is Retrieval Augmented Generation (RAG) Architecture?

5 AWS Glue Use Cases and Examples That Showcase Its Power

Improved Ascend for Databricks, New Lineage Visualization, and Better Incremental Data Ingestion

Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

The Ultimate 101 Guide to Apache Airflow DAGS

Top 10 Data Engineering Tools You Must Learn in 2025

The Ultimate Guide to Getting Started with AWS Athena in 2025

Snowflake’s Best-in-Class Enterprise Data Foundation Unlocks Interoperability with Open Data and Internal Collaboration

Data Engineering Weekly #179

100+ Big Data Interview Questions and Answers 2025

How to learn data engineering

Scalable Annotation Service?—?Marken

How to Transition from ETL Developer to Data Engineer?

The Modern Data Lakehouse: An Architectural Innovation

Stay Connected