Architecture and Cloud Storage - Data Engineering Digest

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Race For Data Quality In A Medallion Architecture The Medallion architecture pattern is gaining traction among data teams. The Medallion architecture is a design pattern that helps data teams organize data processing and storage into three distinct layers, often called Bronze, Silver, and Gold.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Iceberg tables become interoperable while maintaining ACID compliance by adding a layer of metadata to the data files in a users object storage.

Data Lake

Data Lake Metadata Cloud Storage Data Warehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Though basic and easy to use, traditional table storage formats struggle to keep up. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)?

Architecture

Architecture Systems Data Lake Google Cloud

Enabling Multi-User Fine-Grained Access Control for Cloud Storage in CDP

Cloudera

SEPTEMBER 10, 2021

Shared Data Experience ( SDX ) on Cloudera Data Platform ( CDP ) enables centralized data access control and audit for workloads in the Enterprise Data Cloud. The public cloud (CDP-PC) editions default to using cloud storage (S3 for AWS, ADLS-gen2 for Azure).

Cloud Storage

Cloud Storage Accessible Accessibility Cloud

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. What are the technical/architectural/UX challenges that have hindered the progression of lakehouses? Want to see Starburst in action? Want to see Starburst in action?

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

Apache Iceberg’s ecosystem of diverse adopters, contributors and commercial support continues to grow, establishing itself as the industry standard table format for an open data lakehouse architecture. Snowflake’s support for Iceberg Tables is now in public preview, helping customers build and integrate Snowflake into their lake architecture.

Building

Building Metadata Cloud Storage AWS

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

MARCH 2, 2023

Whether you use Snowpipe Streaming as a standalone client or as part of your Kafka architecture, you can create scalable and reliable data pipelines with a fully managed underlying infrastructure with built-in observability.

Kafka

Kafka Data Ingestion Data Pipeline Cloud Storage

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Table 1: Movie and File Size Examples Initial Architecture A simplified view of our initial cloud video processing pipeline is illustrated in the following diagram. Figure 1: A Simplified Video Processing Pipeline With this architecture, chunk encoding is very efficient and processed in distributed cloud computing instances.

Cloud

Cloud Bytes Cloud Storage Media

Introducing Compute-Compute Separation for Real-Time Analytics

Rockset

MARCH 1, 2023

When you deconstruct the core database architecture, deep in the heart of it you will find a single component that is performing two distinct competing functions: real-time data ingestion and query serving. Every database built for real-time analytics has a fundamental limitation. Michael Carey.

Data Ingestion

Data Ingestion Database Architecture SQL

Best Practices for Real-Time Stream Processing

Striim

MARCH 21, 2025

Between continuous real-time collection of data, and its delivery to enterprise and cloud destinations, data has to move in a reliable and scalable way. There are architectural and technology decisions every step of the way – not just at design time, but also at run time.

Process

Process Data Warehouse Kafka Data Pipeline

What is Azure architecture?

Knowledge Hut

MARCH 14, 2024

Azure is among the top cloud service providers. Azure architecture includes all the ideas and elements needed to build a safe, dependable, and scalable cloud application. What Is Microsoft Azure Cloud Architecture? Users can view and access their files from anywhere with its cloud storage capabilities.

Architecture

Architecture Cloud Computing Utilities Machine Learning

Accelerate Analytics for All

Cloudera

AUGUST 17, 2022

CDP One is a new service from Cloudera that is the first data lakehouse SaaS offering with cloud compute, cloud storage, machine learning (ML), streaming analytics, and enterprise grade security built-in. What if you could access all your data and execute all your analytics in one workflow, quickly with only a small IT team?

Cloud Computing

Cloud Computing Cloud Storage Data Science Government

Discover And De-Clutter Your Unstructured Data With Aparavi

Data Engineering Podcast

JUNE 12, 2022

The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java.

Unstructured Data

Unstructured Data MongoDB MySQL Scala

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

CDP Public Cloud is already available on Amazon Web Services and Microsoft Azure. With the addition of Google Cloud, we deliver on our vision of providing a hybrid and multi-cloud architecture to support our customer’s analytics needs regardless of deployment platform. .

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

A Breakthrough Architecture for Real-Time Analytics- An Overview of Compute-Compute Separation in Rockset

Rockset

MARCH 1, 2023

Rockset introduces a new architecture that enables separate virtual instances to isolate streaming ingestion from queries and one application from another. Benefits of Compute-Compute Separation In this new architecture, virtual instances contain the compute and memory needed for streaming ingest and queries.

Architecture

Architecture AWS SQL Cloud Storage

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

By separating the compute, the metadata, and data storage, CDW dynamically adapts to changing workloads and resource requirements, speeding up deployment while effectively managing costs – while preserving a shared access and governance model. Architecture overview. Separate storage. Get your data in place.

IT

IT Data Lake Data Warehouse Cloud Storage

ThoughtSpot Sage: data security with large language models

ThoughtSpot

MAY 31, 2023

The architecture is designed to be resilient against new-age attacks against LLMs like prompt injection and prompt leaks. Architecture Let's start with the big picture and tackle how we adjusted our cloud architecture with additional internal and external interfaces to integrate LLM.

Data Security

Data Security Metadata Data Warehouse Transportation

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Modern data platforms deliver an elastic, flexible, and cost-effective environment for analytic applications by leveraging a hybrid, multi-cloud architecture to support data fabric, data mesh, data lakehouse and, most recently, data observability. The high-level architecture shown below forms the backdrop for the exploration.

Data Lake

Data Lake Analytics Application Cloud Storage Architecture

Data Pipeline- Definition, Architecture, Examples, and Use Cases

ProjectPro

DECEMBER 7, 2021

This blog will give you an in-depth knowledge of what is a data pipeline and also explore other aspects such as data pipeline architecture, data pipeline tools, use cases, and so much more. Features of a Data Pipeline Data Pipeline Architecture How to Build an End-to-End Data Pipeline from Scratch? What is a Big Data Pipeline?

Data Pipeline

Data Pipeline Architecture Kafka AWS

Open Source Object Storage For All Of Your Data

Data Engineering Podcast

SEPTEMBER 22, 2019

Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage?

AWS

AWS Google Cloud Cloud Storage Data Lake

How to Build a 5-Layer Data Stack

Monte Carlo

JULY 19, 2023

Those tools include: Cloud storage and compute Data transformation Business intelligence Data observability And orchestration And we won’t mention ogres or bean dip again. Cloud storage and compute Whether you’re stacking data tools or pancakes, you always build from the bottom up. Let’s dive into it.

Building

Building Business Intelligence Cloud Storage BI

Cloudera Data Engineering 2021 Year End Review

Cloudera

DECEMBER 21, 2021

A new capability called Ranger Authorization Service (RAZ) provides fine grained authorization on cloud storage. We are excited to offer in Tech Preview this born-in-the-cloud table format that will help future proof data architectures at many of our public cloud customers. Modernizing pipelines.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

What Is Cloud Computing Architecture and Its Main Components?

U-Next

SEPTEMBER 22, 2022

There is no question that cloud computing is here to stay because its architecture is simple, stating its components and subcomponents in clear terms. It is ubiquitous today, offering many advantages in terms of flexibility, maintenance, sharing, and storage, among others. What Is Cloud Computing Architecture? .

Cloud Computing

Cloud Computing Architecture Cloud IT

Data Governance and Strategy for the Global Enterprise

Cloudera

SEPTEMBER 23, 2022

Finally, cloud computing adds low cost and high resiliency to these services. The advantages provide the foundation for the modern data lakehouse architectural pattern. Cloud storage is versioned as well, and should you inadvertently delete important data the SaaS CDP One ops team can quickly recover it for you.

Data Governance

Data Governance Government Amazon Web Services Cloud Computing

A Serverless Query Engine from Spare Parts

Towards Data Science

APRIL 26, 2023

Moreover, the data will need to leave the cloud env to go on our machine, which is not exactly secure and auditable. To make the cloud experience as smooth as possible we designed a data lake architecture where data are sitting in a simple cloud storage (AWS S3) and a serverless infrastructure that embeds DuckDB works as a query engine.

Engineering

Engineering Data Lake AWS BI

A Definitive Guide to Using BigQuery Efficiently

Towards Data Science

MARCH 5, 2024

In this environment, the emphasis shifts from minimizing storage space to optimizing query performance. In BigQuery, de-normalization emerges as a preferred strategy for several reasons: Query Performance : BigQuery’s distributed architecture excels at scanning large volumes of data in parallel.

Bytes

Bytes Google Cloud Cloud Storage Utilities

Discover and Explore Data Faster with the CDP DDE Template

Cloudera

SEPTEMBER 1, 2020

YARN allows you to use various data processing engines for batch, interactive, and real-time stream processing of data stored in HDFS or cloud storage like S3 and ADLS. You need to configure the backup repository in solr xml to point to your cloud storage location (in this example your S3 bucket). Prerequisites.

Cloud Storage

Cloud Storage Unstructured Data AWS Analytics Application

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github: The architecture of today’s LLM applications LLM is slowly changing the application architecture landscape as it becomes integral to app development. Github writes an excellent blog to capture the current state of the LLM integration architecture. Visit rudderstack.com to learn more. Partitions, ever-present.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

A new solution integrating cloud object storage, with Cloudera’s NiFi dataflows, a Kafka datahub, and a Hive virtual warehouse in the CDW service allows businesses to take the best advantage of this public cloud trend. The Cost-Effective Data Warehouse Architecture. This architecture has the following benefits .

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

Netflix Drive

Netflix Tech

MAY 5, 2021

A file and folder interface for Netflix Cloud Services Written by Vikram Krishnamurthy , Kishore Kasi , Abhishek Kapatkar , and Tejas Chopra In this post, we are introducing Netflix Drive, a Cloud drive for media assets and providing a high level overview of some of its features and interfaces.

Metadata

Metadata Bytes Media Cloud Storage

Introducing rules_gcs

Tweag

OCTOBER 16, 2024

We recently completed a project with IMAX, where we learned that they had developed a way to simplify and optimize the process of integrating Google Cloud Storage (GCS) with Bazel. rules_gcs is a Bazel ruleset that facilitates the downloading of files from Google Cloud Storage. What is rules_gcs ?

Google Cloud

Google Cloud Cloud Storage Accessible Accessibility

DELL/EMC taking the next step with PowerScale and ECS certification on CDP Private Cloud Base

Cloudera

OCTOBER 26, 2020

The certification process is designed to validate Cloudera products on a variety of Cloud, Storage & Compute Platforms. Validation includes: Overall architecture. Observance of the CDP interface classification system.

Certification

Certification Cloud Kafka Unstructured Data

The Good and the Bad of Apache Kafka Streaming Platform

AltexSoft

OCTOBER 21, 2022

But to understand why Kafka is omnipresent we have to look at how it works — in other words, to get familiar with its concepts and architecture. Kafka architecture. Read our article on event-driven architecture and Pub/Sub to learn more about this powerful communication paradigm. Kafka cluster architecture. Scalability.

Kafka

Kafka Hadoop Big Data ETL Tools

How Much Data Do We Need? Balancing Machine Learning with Security Considerations

Towards Data Science

DECEMBER 15, 2023

Maybe you need to scale up to a cloud storage provider like Snowflake or AWS to keep up and make all this data accessible at the pace you need. You probably need to attend to data architecture to try and keep costs from skyrocketing, but what about data retention? This isn’t sustainable, though — not forever anyway.

Machine Learning

Machine Learning Data Science Data Security Data Storage

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. Architecture. In order to copy or migrate data from CDH cluster to CDP Data Lake cluster, the on-prem CDH cluster should be able to access the CDP cloud storage.

Cloud

Cloud Data Lake Cloud Storage Metadata

Streamline RAG with New Document Preprocessing Features

Snowflake

OCTOBER 15, 2024

To provide accurate answers, developers can use a RAG-based architecture, where the LLM retrieves relevant internal knowledge from documents, wikis or FAQs before generating a response. Since a pre-trained LLM alone will lack deep expertise in your company’s products, the answers generated are likely to be incorrect and of no value.

SQL

SQL Data Preparation Electronics Python

Accelerate your Data Migration to Snowflake

RandomTrees

SEPTEMBER 6, 2020

Lot of cloud-based data warehouses are available in the market today, out of which let us focus on Snowflake. Built on new SQL database engine, it provides a unique architecture designed for the cloud. Snowflake architecture provides flexibility with big data. Here’s a detail on the architecture of Snowflake.

Cloud Storage

Cloud Storage Data Ingestion Data Cleanse Data Warehouse

Data Architect: Role Description, Skills, Certifications and When to Hire

AltexSoft

FEBRUARY 11, 2023

To get a better understanding of a data architect’s role, let’s clear up what data architecture is. Data architecture is the organization and design of how data is collected, transformed, integrated, stored, and used by a company. Sample of a high-level data architecture blueprint for Azure BI programs.

Data Architect

Data Architect Certification Generalist Big Data

How to Build a 5-Layer Data Stack

Towards Data Science

JULY 21, 2023

Those tools include: Cloud storage and compute Data transformation Business intelligence Data observability And orchestration And we won’t mention ogres or bean dip again. Cloud storage and compute Whether you’re stacking data tools or pancakes, you always build from the bottom up. Let’s dive into it.

Building

Building Business Intelligence BI Cloud Storage

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Confluent

FEBRUARY 6, 2019

The serving and monitoring infrastructure need to fit into your overall enterprise architecture and tool stack. Say you wanted to build one integration pipeline from MQTT to Kafka with KSQL for data preprocessing, and use Kafka Connect for data ingestion into HDFS, AWS S3 or Google Cloud Storage, where you do the model training.

Machine Learning

Machine Learning Python Kafka Java

Cloud Computing Future: 12 Trends & Predictions About Cloud

Knowledge Hut

JULY 2, 2024

However, the hybrid cloud is not going away anytime soon. In fact, the hybrid cloud will likely become even more common as businesses move more of their workloads to the cloud. So what will be the future of cloud storage and security? With guidance from industry experts, be ready for a future in the domain.

Cloud Computing

Cloud Computing Cloud Healthcare Education

When To Use Internal vs. External Stages in Snowflake

phData: Data Engineering

AUGUST 4, 2023

Data storage is a vital aspect of any Snowflake Data Cloud database. Within Snowflake, data can either be stored locally or accessed from other cloud storage systems. In Snowflake, there are three different storage layers available, Database, Stage, and Cloud Storage.

Cloud Storage

Cloud Storage Google Cloud Amazon Web Services Data Storage

How ATB Financial is Utilizing Hybrid Cloud to Reduce the Time to Value for Big Data Analytics by 90 Percent

Cloudera

FEBRUARY 7, 2019

Implementing a Modern Data Architecture. With this expanded scope, the organization has introduced its Cloud Storage Connector, which has become a fully integrated component for data access and processing of Hadoop and Spark workloads.

Big Data

Big Data Utilities Google Cloud Data Analytics

Top 10 Data Science Websites to learn More

Knowledge Hut

FEBRUARY 29, 2024

File systems can store small datasets, while computer clusters or cloud storage keeps larger datasets. The designer must decide and understand the data storage, and inter-relation of data elements. Repository GitHub: It is a place to find detailed codes, architecture design.

Data Science

Data Science Datasets Machine Learning Database Design

The Race For Data Quality in a Medallion Architecture

How Apache Iceberg Is Changing the Face of Data Lakes

Trending Sources

Why Open Table Format Architecture is Essential for Modern Data Systems

Enabling Multi-User Fine-Grained Access Control for Cloud Storage in CDP

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Netflix Cloud Packaging in the Terabyte Era

Introducing Compute-Compute Separation for Real-Time Analytics

Best Practices for Real-Time Stream Processing

What is Azure architecture?

Accelerate Analytics for All

Discover And De-Clutter Your Unstructured Data With Aparavi

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

A Breakthrough Architecture for Real-Time Analytics- An Overview of Compute-Compute Separation in Rockset

Get Your Analytics Insights Instantly – Without Abandoning Central IT

ThoughtSpot Sage: data security with large language models

Demystifying Modern Data Platforms

Data Pipeline- Definition, Architecture, Examples, and Use Cases

Open Source Object Storage For All Of Your Data

How to Build a 5-Layer Data Stack

Cloudera Data Engineering 2021 Year End Review

What Is Cloud Computing Architecture and Its Main Components?

Data Governance and Strategy for the Global Enterprise

A Serverless Query Engine from Spare Parts

A Definitive Guide to Using BigQuery Efficiently

Discover and Explore Data Faster with the CDP DDE Template

Data Engineering Weekly #151

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Netflix Drive

Introducing rules_gcs

DELL/EMC taking the next step with PowerScale and ECS certification on CDP Private Cloud Base

The Good and the Bad of Apache Kafka Streaming Platform

How Much Data Do We Need? Balancing Machine Learning with Security Considerations

Migrate Hive data from CDH to CDP public cloud

Streamline RAG with New Document Preprocessing Features

Accelerate your Data Migration to Snowflake

Data Architect: Role Description, Skills, Certifications and When to Hire

How to Build a 5-Layer Data Stack

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Cloud Computing Future: 12 Trends & Predictions About Cloud

When To Use Internal vs. External Stages in Snowflake

How ATB Financial is Utilizing Hybrid Cloud to Reduce the Time to Value for Big Data Analytics by 90 Percent

Top 10 Data Science Websites to learn More

Stay Connected