Blog, Cloud Storage and Metadata - Data Engineering Digest

Azure Blob Storage: Hidden Gem of Cloud Storage Solutions

ProjectPro

JUNE 6, 2025

Unlock the power of scalable cloud storage with Azure Blob Storage! This Azure Blob Storage tutorial offers everything you need to know to get started with this scalable cloud storage solution. By 2030, the global cloud storage market is likely to be worth USD 490.8

Cloud Storage

Cloud Storage Cloud Unstructured Data Data Lake

MLFlow Mastery: A Complete Guide to Experiment Tracking and Model Management

KDnuggets

JUNE 23, 2025

It also integrates with cloud storage for added flexibility. mlruns This command uses an SQLite database for metadata storage and saves artifacts in the mlruns directory. This format includes the model and its metadata. Metadata has the models framework, version, and dependencies.

Management

Management Machine Learning Metadata Data Science

Data federation: Understanding what it is and how it works

RudderStack

JUNE 24, 2025

Key components include metadata management, federation middleware, and role-based access controls to ensure governance and compliance. Connecting distributed sources The process starts by connecting to various data sources like relational databases, NoSQL databases, APIs, and cloud storage systems.

IT

IT Data Consolidation Metadata Government

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

With this public preview, those external catalog options are either “GLUE”, where Snowflake can retrieve table metadata snapshots from AWS Glue Data Catalog, or “OBJECT_STORE”, where Snowflake retrieves metadata snapshots directly from the specified cloud storage location. Now, Snowflake can make changes to the table.

Building

Building Metadata Cloud Storage AWS

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? Why should we use it? A Brief History of OTF A comparative study between the major OTFs.

Architecture

Architecture Systems Data Lake Google Cloud

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Cloudera

SEPTEMBER 29, 2020

Performance is one of the key, if not the most important deciding criterion, in choosing a Cloud Data Warehouse service. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to Microsoft HDInsight (also powered by Apache Hive-LLAP) on Azure using the TPC-DS 2.9

Data Warehouse

Data Warehouse Cloud Storage Metadata Cloud

50 PySpark Interview Questions and Answers For 2025

ProjectPro

JUNE 6, 2025

We can store the data and metadata in a checkpointing directory. In Spark, checkpointing may be used for the following data categories- Metadata checkpointing: Metadata rmeans information about information. It refers to storing metadata in a fault-tolerant storage system such as HDFS. appName('ProjectPro').getOrCreate()

Hadoop

Hadoop Metadata Java Datasets

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Our previous tech blog Packaging award-winning shows with award-winning technology detailed our packaging technology deployed on the streaming side. Table 1: Movie and File Size Examples Initial Architecture A simplified view of our initial cloud video processing pipeline is illustrated in the following diagram.

Cloud

Cloud Bytes Cloud Storage Media

15 Latest Snowflake Datawarehouse Interview Questions and Answers

ProjectPro

JUNE 6, 2025

To help you prepare for your data warehouse engineer interview, we have included a list of some popular Snowflake interview questions and answers in this blog. The data is organized in a columnar format in the Snowflake cloud storage. How does Snowflake store data? Is Snowflake an ETL tool? Define staging in Snowflake.

Amazon Web Services

Amazon Web Services Data Warehouse ETL Tools AWS

Accelerate Analytics for All

Cloudera

AUGUST 17, 2022

CDP One is a new service from Cloudera that is the first data lakehouse SaaS offering with cloud compute, cloud storage, machine learning (ML), streaming analytics, and enterprise grade security built-in. Secure single tenant cloud infrastructure. The post Accelerate Analytics for All appeared first on Cloudera Blog.

Cloud Storage

Cloud Storage Cloud Computing Government Data Science

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

Customers who have chosen Google Cloud as their cloud platform can now use CDP Public Cloud to create secure governed data lakes in their own cloud accounts and deliver security, compliance and metadata management across multiple compute clusters. A provisioning Service Account with these roles assigned.

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

The Ultimate 101 Guide to Apache Airflow DAGS

ProjectPro

JUNE 6, 2025

Read this blog till the end to learn everything you need to know about Airflow DAG. This blog will dive into the details of Apache Airflow DAGs, exploring how they work and multiple examples of using Airflow DAGs for data processing and automation workflows. Apache Airflow DAGs are your one-stop solution!

Data Pipeline

Data Pipeline PostgreSQL Python Database

Carbon Hack 24: Leveraging the Impact Framework to Estimate the Carbon Cost of Cloud Storage by Matt Griffin

Scott Logic

APRIL 10, 2024

This blog post serves as a dev diary of the process, covering our challenges, contributions made and attempts to validate them. We started to consider breaking the components down into different plugins, which could be used for more than just cloud storage.

Cloud Storage

Cloud Storage Cloud AWS Metadata

Netflix Drive

Netflix Tech

MAY 5, 2021

Netflix Drive relies on a data store that will be the persistent storage layer for assets, and a metadata store which will provide a relevant mapping from the file system hierarchy to the data store entities. 2 , are the file system interface, the API interface, and the metadata and data stores. A sample manifest file.

Metadata

Metadata Bytes Media Cloud Storage

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

By separating the compute, the metadata, and data storage, CDW dynamically adapts to changing workloads and resource requirements, speeding up deployment while effectively managing costs – while preserving a shared access and governance model. Separate storage.

IT

IT Data Lake Data Warehouse Cloud Storage

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

Replication Manager can be used to migrate Apache Hive, Apache Impala, and HDFS objects from CDH clusters to CDP Public Cloud clusters. This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. This blog post is not a substitute for that.

Cloud

Cloud Data Lake Cloud Storage Metadata

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

In the previous blog posts in this series, we introduced the N etflix M edia D ata B ase ( NMDB ) and its salient “Media Document” data model. NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries.

Media

Media Database Metadata Data Schemas

Discover and Explore Data Faster with the CDP DDE Template

Cloudera

SEPTEMBER 1, 2020

YARN allows you to use various data processing engines for batch, interactive, and real-time stream processing of data stored in HDFS or cloud storage like S3 and ADLS. Coordinates distribution of data and metadata, also known as shards. For the examples presented in this blog, we assume you have a CDP account already.

Cloud Storage

Cloud Storage Unstructured Data AWS Analytics Application

Group vs Fine-Grained Access Control in Cloudera Data Platform Public Cloud

Cloudera

SEPTEMBER 28, 2021

The Ranger Authorization Service (RAZ) is a new service added to help provide fine-grained access control (FGAC) for cloud storage. We covered the value this new capability provides in a previous blog. Regardless of the storage type or location, all is handled consistently and audited on a per user basis. Conclusion.

Accessible

Accessible Accessibility Cloud Cloud Storage

Top 40+ Cloud Computing Projects to Boost Your Cloud Skills

ProjectPro

JUNE 6, 2025

This shift presents abundant career opportunities, especially in big data and cloud computing , as businesses increasingly rely on cloud technologies. Therefore, gaining hands-on experience through practical projects in cloud computing is now essential for anyone looking to excel in this field.

Cloud Computing

Cloud Computing Cloud Project Google Cloud

Redshift vs. BigQuery: Choosing the Right Data Warehouse

ProjectPro

JUNE 6, 2025

Are you looking to choose the best cloud data warehouse for your next big data project? This blog presents a detailed comparison of two of the very famous cloud warehouses - Redshift vs. BigQuery - to help you pick the right solution for your data warehousing needs. The global data warehousing market will likely reach $51.18

Data Warehouse

Data Warehouse Data Mining Google Cloud PostgreSQL

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

In terms of data analysis, as soon as the front-end visualization or BI tool starts accessing the data, the CDW Hive virtual warehouse will spin up cloud computing resources to combine the persisted historical data from the cloud storage with the latest incremental data from Kafka into a transparent real-time view for the users.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

Introducing rules_gcs

Tweag

OCTOBER 16, 2024

We recently completed a project with IMAX, where we learned that they had developed a way to simplify and optimize the process of integrating Google Cloud Storage (GCS) with Bazel. In this blog post, we’ll dive into the features, installation, and usage of rules_gcs , and how it provides you with access to private resources.

Google Cloud

Google Cloud Cloud Storage Accessible Accessibility

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Foundational to the data fabric are metadata driven pipelines for scalability and resiliency, a unified view of the data from source through to the data products, and the ability to operate across a hybrid, multi-cloud environment. . The post Demystifying Modern Data Platforms appeared first on Cloudera Blog.

Data Lake

Data Lake Cloud Storage Analytics Application Architecture

A Beginner's Guide to AWS Rekognition for Image/Video Analysis

ProjectPro

JUNE 6, 2025

This blog comprehensively overviews Amazon Rekognition's features , use cases, architecture, pricing, projects, etc. Additionally, there's a separate charge for storing face metadata objects necessary for face and user search functionalities. Face metadata storage for face search incurs monthly charges.

AWS

AWS Media Amazon Web Services Machine Learning

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

In the case of CDP Public Cloud, this includes virtual networking constructs and the data lake as provided by a combination of a Cloudera Shared Data Experience (SDX) and the underlying cloud storage. Further auditing can be enabled at a session level so administrators can request key metadata about each CML process.

Machine Learning

Machine Learning Algorithm Government Metadata

Data Architect: Role Description, Skills, Certifications and When to Hire

AltexSoft

FEBRUARY 11, 2023

It serves as a foundation for the entire data management strategy and consists of multiple components including data pipelines; , on-premises and cloud storage facilities – data lakes , data warehouses , data hubs ;, data streaming and Big Data analytics solutions ( Hadoop , Spark , Kafka , etc.);

Data Architect

Data Architect Certification Generalist Big Data

Accelerate your Data Migration to Snowflake

RandomTrees

SEPTEMBER 6, 2020

The architecture is three layered: Database Storage: Snowflake has a mechanism to reorganize the data into its internal optimized, compressed and columnar format and stores this optimized data in cloud storage. Snowflake allows the loading of both structured and semi-structured datasets from cloud storage.

Cloud Storage

Cloud Storage Data Ingestion Data Cleanse Data Warehouse

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Cloudera

MAY 18, 2021

Multi-Cloud Management. Single-cloud visibility with Cloudera Manager. Single-cloud visibility with Ambari. Policy-Driven Cloud Storage Permissions. The post The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations appeared first on Cloudera Blog. Workload Management. Not available.

Hadoop

Hadoop Cloud AWS Utilities

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

DataHub 0.8.36 – Metadata management is a big and complicated topic. DataHub is a completely independent product by LinkedIn, and the folks there definitely know what metadata is and how important it is. If you haven’t found your perfect metadata management system just yet, maybe it’s time to try DataHub!

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

DataHub 0.8.36 – Metadata management is a big and complicated topic. DataHub is a completely independent product by LinkedIn, and the folks there definitely know what metadata is and how important it is. If you haven’t found your perfect metadata management system just yet, maybe it’s time to try DataHub!

Data Engineer

Data Engineer Data Engineering Engineering Kafka

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

One is data at rest, for example in a data lake, warehouse, or cloud storage and from there they can do analytics on this data and that is predominantly around what has already happened or around how to prevent something from happening in the future.

Banking

Banking Kafka Cloud Storage Government

Copy Activity in Azure Data Factory and Azure Synapse Analytics

Edureka

OCTOBER 10, 2024

This activity is rather critical of migrating data, extending cloud and on-premises deployments, and getting data ready for analytics. In this all-encompassing tutorial blog, we are going to give a detailed explanation of the Copy activity with special attention to datastores, file type, and options. can be ingested in Azure.

MongoDB

MongoDB NoSQL Metadata Cloud Storage

How to Transition from ETL Developer to Data Engineer?

ProjectPro

JUNE 6, 2025

Read this blog to know how various data-specific roles, such as data engineer, data scientist, etc., Refining and enhancing local and metadata models. In the thought process of making a career transition from ETL developer to data engineer job roles? billion to USD 87.37 billion in 2025.

Data Engineer

Data Engineer Data Engineering Engineering ETL Tools

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. Each file will have an EDEK which is stored in the file’s metadata. However, we can continue without enabling TLS for the purpose of this blog.

MySQL

MySQL Java Bytes Data

The Good and the Bad of Apache Kafka Streaming Platform

AltexSoft

OCTOBER 21, 2022

popular SQL and NoSQL database management systems including Oracle, SQL Server, Postgres, MySQL, MongoDB, Cassandra, and more; cloud storage services — Amazon S3, Azure Blob, and Google Cloud Storage; message brokers such as ActiveMQ, IBM MQ, and RabbitMQ; Big Data processing systems like Hadoop ; and. ZooKeeper issue.

Kafka

Kafka Hadoop ETL Tools Java

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Confluent

MAY 29, 2019

If you want to follow along and execute all the commands included in this blog post (and the next), you can check out this GitHub repository , which also includes the necessary Docker Compose functionality for running a compatible KSQL and Confluent Platform environment using the recently released Confluent 5.2.1. Sample repository.

Kafka

Kafka Management Bytes SQL

Altus Data Warehouse

Cloudera

SEPTEMBER 9, 2018

Because Altus Data Warehouse uses open source formats and the data resides in your cloud storage rather than in a proprietary data store, there is no concern about vendor lock-in. Altus Data Warehouse is not like other cloud data warehouses. The post Altus Data Warehouse appeared first on Cloudera Blog.

Data Warehouse

Data Warehouse Metadata Cloud Storage Cloud

The Rise of Managed Services for Apache Kafka

Confluent

SEPTEMBER 20, 2019

Before Confluent Cloud was announced , a managed service for Apache Kafka did not exist. This blog post goes over: The complexities that users will run into when self-managing Apache Kafka on the cloud and how users can benefit from building event streaming applications with a fully managed service for Apache Kafka.

Kafka

Kafka Management Cloud AWS

Processing medical images at scale on the cloud

Tweag

APRIL 19, 2023

In this blog post, I will explain the underlying technical challenges and share the solution that we helped implement at kaiko.ai , a MedTech startup in Amsterdam that is building a Data Platform to support AI research in hospitals. A solution is to read the bytes that we need when we need them directly from Blob Storage. width , spec.

Medical

Medical Process Cloud Bytes

20+ Data Engineering Projects for Beginners with Source Code

ProjectPro

AUGUST 24, 2021

And, out of these professions, this blog will discuss the data engineering job role. Then, the Yelp dataset downloaded in JSON format is connected to Cloud SDK, following connections to Cloud storage which is then connected with Cloud Composer. The Yelp dataset JSON stream is published to the PubSub topic.

Data Engineer

Data Engineer Data Engineering Coding Project

The Spiritual Alignment of dbt + Airflow

dbt Developer Hub

NOVEMBER 28, 2021

From the Airflow side A client has 100 data pipelines running via a cron job in a GCP (Google Cloud Platform) virtual machine, every day at 8am. In a Google Cloud Storage bucket. This is the same sensibility expressed in the dbt viewpoint in 2016, the closest thing to a founding blog post as exists for dbt. ]

Google Cloud

Google Cloud SQL Consulting Cloud

Change Data Capture: What It Is and How to Use It

Rockset

JUNE 7, 2021

The CDC system then periodically polls the source file system to check for any new files using the file metadata it stored earlier as a reference. Any new files are then captured and their metadata stored too. Along with the data, the path of the file and the source system it was captured from is also stored.

IT

IT Kafka MongoDB Database

Azure Blob Storage: Hidden Gem of Cloud Storage Solutions

MLFlow Mastery: A Complete Guide to Experiment Tracking and Model Management

Trending Sources

Data federation: Understanding what it is and how it works

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Why Open Table Format Architecture is Essential for Modern Data Systems

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

50 PySpark Interview Questions and Answers For 2025

Netflix Cloud Packaging in the Terabyte Era

15 Latest Snowflake Datawarehouse Interview Questions and Answers

Accelerate Analytics for All

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

The Ultimate 101 Guide to Apache Airflow DAGS

Carbon Hack 24: Leveraging the Impact Framework to Estimate the Carbon Cost of Cloud Storage by Matt Griffin

Netflix Drive

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Migrate Hive data from CDH to CDP public cloud

Implementing the Netflix Media Database

Discover and Explore Data Faster with the CDP DDE Template

Group vs Fine-Grained Access Control in Cloudera Data Platform Public Cloud

Top 40+ Cloud Computing Projects to Boost Your Cloud Skills

Redshift vs. BigQuery: Choosing the Right Data Warehouse

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Introducing rules_gcs

Demystifying Modern Data Platforms

A Beginner's Guide to AWS Rekognition for Image/Video Analysis

Of Muffins and Machine Learning Models

Data Architect: Role Description, Skills, Certifications and When to Hire

Accelerate your Data Migration to Snowflake

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Data Engineering Annotated Monthly – May 2022

Data Engineering Annotated Monthly – May 2022

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Copy Activity in Azure Data Factory and Azure Synapse Analytics

How to Transition from ETL Developer to Data Engineer?

HDFS Data Encryption at Rest on Cloudera Data Platform

The Good and the Bad of Apache Kafka Streaming Platform

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Altus Data Warehouse

The Rise of Managed Services for Apache Kafka

Processing medical images at scale on the cloud

20+ Data Engineering Projects for Beginners with Source Code

The Spiritual Alignment of dbt + Airflow

Change Data Capture: What It Is and How to Use It

Top 100 AWS Interview Questions and Answers for 2025

Stay Connected