Blog and Cloud Storage - Data Engineering Digest

Streaming Big Data Files from Cloud Storage

Towards Data Science

JANUARY 26, 2023

This continues a series of posts on the topic of efficient ingestion of data from the cloud (e.g., Before we get started, let’s be clear…when using cloud storage, it is usually not recommended to work with files that are particularly large. here , here , and here ). CPU cores and TCP connections).

Cloud Storage

Cloud Storage Big Data Cloud AWS

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Uber Engineering

OCTOBER 27, 2024

Our latest blog dives into enabling security for Uber’s modernized batch data lake on Google Cloud Storage! Ready to boost your Hadoop Data Lake security on GCP?

Cloud Storage

Cloud Storage Google Cloud Data Lake Hadoop

Enabling Multi-User Fine-Grained Access Control for Cloud Storage in CDP

Cloudera

SEPTEMBER 10, 2021

Shared Data Experience ( SDX ) on Cloudera Data Platform ( CDP ) enables centralized data access control and audit for workloads in the Enterprise Data Cloud. The public cloud (CDP-PC) editions default to using cloud storage (S3 for AWS, ADLS-gen2 for Azure). For more details, see the following resources .

Cloud Storage

Cloud Storage Accessible Accessibility Cloud

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Creating ArcGIS Cloud Storage (ACS) connection files for STAC

ArcGIS

JUNE 5, 2024

This is a blog that will take users through workflows to create cloud storage connection files that can be used in their stac connections.

Cloud Storage

Cloud Storage Cloud Data Management Management

Cloudera Operational Database (COD) Performance Benchmarking: Comparing HDFS and Cloud Storage

Cloudera

NOVEMBER 9, 2023

Powered by Apache HBase and Apache Phoenix, COD ships out of the box with Cloudera Data Platform (CDP) in the public cloud. It’s also multi-cloud ready to meet your business where it is today, whether AWS, Microsoft Azure, or GCP. We tested for two cloud storages, AWS S3 and Azure ABFS. runtime version.

Cloud Storage

Cloud Storage Database Cloud AWS

Carbon Hack 24: Leveraging the Impact Framework to Estimate the Carbon Cost of Cloud Storage by Matt Griffin

Scott Logic

APRIL 10, 2024

This blog post serves as a dev diary of the process, covering our challenges, contributions made and attempts to validate them. We started to consider breaking the components down into different plugins, which could be used for more than just cloud storage.

Cloud Storage

Cloud Storage Cloud AWS Metadata

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

MARCH 2, 2023

Rather than streaming data from source into cloud object stores then copying it to Snowflake, data is ingested directly into a Snowflake table to reduce architectural complexity and reduce end-to-end latency. And if you are using Amazon Managed Streaming for Apache Kafka (MSK), you can get started using this guided demo.

Kafka

Kafka Data Ingestion Data Pipeline Cloud Storage

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

This blog dives into the remarkable journey of a data team that achieved unparalleled efficiency using DataOps principles and software that transformed their analytics and data teams into a hyper-efficient powerhouse. They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? Amazon S3, Azure Data Lake, or Google Cloud Storage). Why should we use it?

Architecture

Architecture Systems Data Lake Google Cloud

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

With this public preview, those external catalog options are either “GLUE”, where Snowflake can retrieve table metadata snapshots from AWS Glue Data Catalog, or “OBJECT_STORE”, where Snowflake retrieves metadata snapshots directly from the specified cloud storage location. With these three options, which one should you use?

Building

Building Metadata Cloud Storage AWS

How Start Ups Can Benefit From Cloud Computing?

Knowledge Hut

NOVEMBER 16, 2023

While cloud computing is pushing the boundaries of science and innovation into a new realm, it is also laying the foundation for a new wave of business start ups. 5 Reasons Your Startup Should Switch To Cloud Storage Immediately 1) Cost-effective Probably the strongest argument in cloud’s favor I is the cost-effectiveness that it offers.

Cloud Computing

Cloud Computing Cloud Cloud Storage AWS

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Our previous tech blog Packaging award-winning shows with award-winning technology detailed our packaging technology deployed on the streaming side. From chunk encoding to assembly and packaging, the result of each previous processing step must be uploaded to cloud storage and then downloaded by the next processing step.

Cloud

Cloud Bytes Cloud Storage Media

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Cloudera

SEPTEMBER 29, 2020

Performance is one of the key, if not the most important deciding criterion, in choosing a Cloud Data Warehouse service. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to Microsoft HDInsight (also powered by Apache Hive-LLAP) on Azure using the TPC-DS 2.9

Data Warehouse

Data Warehouse Cloud Storage Metadata Cloud

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

By storing data in its native state in cloud storage solutions such as AWS S3, Google Cloud Storage, or Azure ADLS, the Bronze layer preserves the full fidelity of the data. This foundational layer is a repository for various data types, from transaction logs and sensor data to social media feeds and system logs.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

For example, you can create a custom cluster today that includes both NiFi and Spark; this will allow you to use the extensive library of NiFi processors to easily ingest data into Google Cloud Storage, use Spark for processing and preparing the data for analytics, all in one cluster.

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

Cloudera

AUGUST 30, 2022

This blog lays out some steps to help you incrementally advance efforts to be a more data-driven, customer-centric organization. Cloudera refers to this as universal data distribution, as explored further in this blog post. In some cases, firms are surprised by cloud storage costs and looking to repatriate data.

Cloud Storage

Cloud Storage Government Data Governance Retail

How-to: Index Data from S3 via NiFi Using CDP Data Hubs

Cloudera

OCTOBER 15, 2020

About this Blog. Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. Spark as the ingest pipeline tool for Search (i.e. Assumptions.

AWS

AWS Data Cloud Accessible

Cost Conscious Data Warehousing with Cloudera Data Platform

Cloudera

DECEMBER 10, 2020

2,300 / month for the cloud hardware costs. 150 / month for the cloud storage. 5,394 / month for storage access costs (scan, read, write). This totals $11,734 for the cloud provider’s house offering vs $2,968 from CDW! 700 / month for security and metadata service . $90 90 / month for monitoring services .

Data Warehouse

Data Warehouse Cloud Storage Metadata Data

Aaand the New NiFi Champion is…

Cloudera

JUNE 5, 2023

This blog is to congratulate our winner and review the top submissions. RK built some simple flows to pull streaming data into Google Cloud Storage and Snowflake. On May 3, 2023, Cloudera kicked off a contest called “Best in Flow” for NiFi developers to compete to build the best data pipelines. Congratulations Vince!

Google Cloud

Google Cloud Cloud Storage Data Lake Data Pipeline

Discover and Explore Data Faster with the CDP DDE Template

Cloudera

SEPTEMBER 1, 2020

YARN allows you to use various data processing engines for batch, interactive, and real-time stream processing of data stored in HDFS or cloud storage like S3 and ADLS. For the examples presented in this blog, we assume you have a CDP account already. Before you Get Started. Prerequisites. In this example: s3a://dde-bucket.

Cloud Storage

Cloud Storage Unstructured Data AWS Analytics Application

Access control for Azure ADLS cloud object storage

Cloudera

SEPTEMBER 15, 2020

introduces fine-grained authorization for access to Azure Data Lake Storage using Apache Ranger policies. Cloudera and Microsoft have been working together closely on this integration, which greatly simplifies the security administration of access to ADLS-Gen2 cloud storage. Cloudera Data Platform 7.2.1 What’s next?

Accessible

Accessible Accessibility Cloud Cloud Storage

Group vs Fine-Grained Access Control in Cloudera Data Platform Public Cloud

Cloudera

SEPTEMBER 28, 2021

The Ranger Authorization Service (RAZ) is a new service added to help provide fine-grained access control (FGAC) for cloud storage. We covered the value this new capability provides in a previous blog. Please take a look at this use case blog to see how these cases are available for CDP Public Cloud deployments.

Accessible

Accessible Accessibility Cloud Cloud Storage

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

Replication Manager can be used to migrate Apache Hive, Apache Impala, and HDFS objects from CDH clusters to CDP Public Cloud clusters. This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. This blog post is not a substitute for that.

Cloud

Cloud Data Lake Cloud Storage Metadata

Open Source Object Storage For All Of Your Data

Data Engineering Podcast

SEPTEMBER 22, 2019

What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage? What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage? What do you have planned for the future of MinIO?

AWS

AWS Google Cloud Cloud Storage Data Lake

Cloudera Data Engineering 2021 Year End Review

Cloudera

DECEMBER 21, 2021

A new capability called Ranger Authorization Service (RAZ) provides fine grained authorization on cloud storage. Along with delivering the world’s first true hybrid data cloud, stay tuned for product announcements that will drive even more business value with innovative data ops and engineering capabilities.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

Introducing Compute-Compute Separation for Real-Time Analytics

Rockset

MARCH 1, 2023

Step 1: Separate Compute and Storage One of the ways we first extended RocksDB to run in the cloud was by building RocksDB Cloud , in which the SST files created upon a memtable flush are also backed into cloud storage such as Amazon S3.

Data Ingestion

Data Ingestion Database Architecture SQL

Best Practices for Real-Time Stream Processing

Striim

MARCH 21, 2025

Striim customers often utilize a single streaming source for delivery into Kafka, Cloud Data Warehouses, and cloud storage, simultaneously and in real-time. Building streaming data pipelines shouldnt require custom coding Building data pipelines and working with streaming data should not require custom coding.

Process

Process Data Warehouse Kafka Data Pipeline

Data Governance and Strategy for the Global Enterprise

Cloudera

SEPTEMBER 23, 2022

In a recent blog, Cloudera Chief Technology Officer Ram Venkatesh described the evolution of a data lakehouse, as well as the benefits of using an open data lakehouse, especially the open Cloudera Data Platform (CDP). Modern data lakehouses are typically deployed in the cloud. If you missed it, you can read up about it here.

Data Governance

Data Governance Government Amazon Web Services Cloud Computing

DELL/EMC taking the next step with PowerScale and ECS certification on CDP Private Cloud Base

Cloudera

OCTOBER 26, 2020

*For clarity, the scope of the current certification covers CDP-Private Cloud Base. Certification of CDP-Private Cloud Experiences will be considered in the future. The certification process is designed to validate Cloudera products on a variety of Cloud, Storage & Compute Platforms.

Certification

Certification Cloud Kafka Unstructured Data

Top 10 Data Science Websites to learn More

Knowledge Hut

FEBRUARY 29, 2024

File systems can store small datasets, while computer clusters or cloud storage keeps larger datasets. The designer must decide and understand the data storage, and inter-relation of data elements. It offers various blogs based on above mentioned technology in alphabetical order.

Data Science

Data Science Datasets Machine Learning Database Design

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

Separate storage. Cloudera’s Data Warehouse service allows raw data to be stored in the cloud storage of your choice (S3, ADLSg2). It will be stored in your own namespace, and not force you to move data into someone else’s proprietary file formats or hosted storage. Get your data in place. S3 bucket).

IT

IT Data Lake Data Warehouse Cloud Storage

Cloudera announces support for Azure’s next-generation Data Lake Store

Cloudera

FEBRUARY 14, 2019

But working with cloud storage has often been a compromise. Enterprises started moving to the cloud expecting infinite scalability and simultaneous cost savings, but the reality has often turned out to be more nuanced. The introduction of ADLS Gen1 was exciting because it was cloud storage that behaved like HDFS.

Data Lake

Data Lake Hadoop Cloud Storage Cloud

Introducing rules_gcs

Tweag

OCTOBER 16, 2024

We recently completed a project with IMAX, where we learned that they had developed a way to simplify and optimize the process of integrating Google Cloud Storage (GCS) with Bazel. In this blog post, we’ll dive into the features, installation, and usage of rules_gcs , and how it provides you with access to private resources.

Google Cloud

Google Cloud Cloud Storage Accessible Accessibility

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github writes an excellent blog to capture the current state of the LLM integration architecture. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications. I experienced similar drawbacks to what Lyft is talking about in Druid.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

In terms of data analysis, as soon as the front-end visualization or BI tool starts accessing the data, the CDW Hive virtual warehouse will spin up cloud computing resources to combine the persisted historical data from the cloud storage with the latest incremental data from Kafka into a transparent real-time view for the users.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

Data Engineering Weekly #184

Data Engineering Weekly

AUGUST 11, 2024

[link] Uber: Enabling Security for Hadoop Data Lake on Google Cloud Storage Uber writes about securing a Hadoop-based data lake on Google Cloud Platform (GCP) by replacing HDFS with Google Cloud Storage (GCS) while maintaining existing security models like Kerberos-based authentication.

Data Engineering

Data Engineering Data Engineer Google Cloud Engineering

Delivering High Performance for Cloudera Data Platform Operational Database (HBase) When Using S3

Cloudera

DECEMBER 8, 2021

In this blog, we’ll share how CDP Operational Database can deliver high performance for your applications when running on AWS S3. CDP Operational Database allows developers to use Amazon Simple Storage Service (S3) as its main persistence layer for saving table data.

Database

Database AWS Datasets Cloud Storage

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

The data may be in various file formats within cloud storage, but the data lakehouse delivers it as a virtual relational data warehouse for consumption. The post Demystifying Modern Data Platforms appeared first on Cloudera Blog. Ramsey International Modern Data Platform Architecture.

Data Lake

Data Lake Analytics Application Cloud Storage Architecture

Educating Data Analysts at Scale: Cloudera Launches Modern Big Data Analysis with SQL on Coursera

Cloudera

JULY 15, 2019

Managing Big Data in Clusters and Cloud Storage , teaches how to manage big datasets, how to load them into clusters and cloud storage, and how to apply structure to the data so you can run queries on it using distributed SQL engines. Glynn Durham is a Senior Instructor at Cloudera.

Education

Education Big Data Data Analysis SQL

How to Master Data Transformations with DBT Materializations?

Workfall

JULY 18, 2023

In this blog, we’ll whisk you away on an enchanting journey through DBT materializations. In this blog, we will cover: What is DBT? On the other hand, external table materializations allow us to store the pre-calculated results in external systems, such as cloud storage.

Datasets

Datasets Entertainment Data Workflow Data

Streamline RAG with New Document Preprocessing Features

Snowflake

OCTOBER 15, 2024

In this blog post, we will show how Snowflake’s integrated functionality simplifies building and deploying RAG-based applications. Organizations can get started quickly by pointing the PARSE_DOCUMENT SQL function to process PDF documents available in a cloud storage service accessible via an External Stage (e.g.,

SQL

SQL Data Preparation Electronics Python

Netflix Drive

Netflix Tech

MAY 5, 2021

On restart on a new machine, the same files and folders will be prefetched from the cloud. We will cover the different namespaces of Netflix Drive in more detail in a subsequent blog post. Finally, once the encoded copy is prepared, this copy can be persisted by Netflix Drive to a persistent storage tier in the cloud.

Metadata

Metadata Bytes Media Cloud Storage

Data News — December 2023

Christophe Blefari

DECEMBER 31, 2023

To finish the year Airflow team have released improvements to Datasets and a major step forward with the new Object Storage API that provides a generic abstraction over Cloud Storage to transfer data from one to another. Other reads The state of SQL-based observability , on ClickHouse blog.

Data

Data Python Cloud Storage Datasets

Streaming Big Data Files from Cloud Storage

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Webinars

Trending Sources

Enabling Multi-User Fine-Grained Access Control for Cloud Storage in CDP

Webinars

Creating ArcGIS Cloud Storage (ACS) connection files for STAC

Cloudera Operational Database (COD) Performance Benchmarking: Comparing HDFS and Cloud Storage

Carbon Hack 24: Leveraging the Impact Framework to Estimate the Carbon Cost of Cloud Storage by Matt Griffin

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Drug Launch Case Study: Amazing Efficiency Using DataOps

Why Open Table Format Architecture is Essential for Modern Data Systems

Top 10 Emerging Technologies Blogs To Read In 2023

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

How Start Ups Can Benefit From Cloud Computing?

Netflix Cloud Packaging in the Terabyte Era

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

The Race For Data Quality in a Medallion Architecture

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

How-to: Index Data from S3 via NiFi Using CDP Data Hubs

Cost Conscious Data Warehousing with Cloudera Data Platform

Aaand the New NiFi Champion is…

Discover and Explore Data Faster with the CDP DDE Template

Access control for Azure ADLS cloud object storage

Group vs Fine-Grained Access Control in Cloudera Data Platform Public Cloud

Migrate Hive data from CDH to CDP public cloud

Open Source Object Storage For All Of Your Data

Cloudera Data Engineering 2021 Year End Review

Introducing Compute-Compute Separation for Real-Time Analytics

Best Practices for Real-Time Stream Processing

Data Governance and Strategy for the Global Enterprise

DELL/EMC taking the next step with PowerScale and ECS certification on CDP Private Cloud Base

Top 10 Data Science Websites to learn More

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera announces support for Azure’s next-generation Data Lake Store

Introducing rules_gcs

Data Engineering Weekly #151

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Data Engineering Weekly #184

Delivering High Performance for Cloudera Data Platform Operational Database (HBase) When Using S3

Demystifying Modern Data Platforms

Educating Data Analysts at Scale: Cloudera Launches Modern Big Data Analysis with SQL on Coursera

How to Master Data Transformations with DBT Materializations?

Streamline RAG with New Document Preprocessing Features

Netflix Drive

Data News — December 2023

Stay Connected