Blog and Data Storage - Data Engineering Digest

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Späti

NOVEMBER 28, 2018

However, this is still not common in the Data Warehouse (DWH) field. In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system. Why is this?

Data Warehouse

Data Warehouse Data Storage Data Architecture Architecture

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. DeepSeek’s smallpond Takes on Big Data. DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. link] Mehdio: DuckDB goes distributed?

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Telco 5G Returns Will Come from Enterprise Data Solutions

Cloudera

APRIL 22, 2022

This blog post was written by Dean Bubley , industry analyst, as a guest author for Cloudera. . The focus has also been hugely centred on compute rather than data storage and analysis. But there may be a large gap between when “compute” occurs, compared to when data is collected and how it is stored.

Data Solutions

Data Solutions Amazon Web Services Data Storage Google Cloud

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

Data engineering can help with it. It is the force behind seamless data flow, enabling everything from AI-driven automation to real-time analytics. To stay competitive, businesses need to adapt to new trends and find new ways to deal with ongoing problems by taking advantage of new possibilities in data engineering.

Data Engineering

Data Engineering Data Engineer Engineering Consulting

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Though basic and easy to use, traditional table storage formats struggle to keep up. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)?

Architecture

Architecture Systems Data Lake Google Cloud

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Cloudera

JANUARY 6, 2021

Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle. Put Operations.

Machine Learning

Machine Learning Data Science Database Building

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

This blog post describes the advantages of real-time ETL and how it increases the value gained from Snowflake implementations. With instant elasticity, high-performance, and secure data sharing across multiple clouds , Snowflake has become highly in-demand for its cloud-based data warehouse offering.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

Data News — Week 23.03

Christophe Blefari

JANUARY 20, 2023

Summer in coming ( credits ) Hey, new Friday, new Data News edition. Thank you for every recommendation you do about the blog or the Data News. There is an introduction post about DataHub — when you look at what you have to run to launch a data catalog: 4 components and 4 different data storage.

Google Cloud

Google Cloud Data Hadoop Machine Learning

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Data Engineering Podcast

NOVEMBER 22, 2017

To help other people find the show you can leave a review on iTunes , or Google Play Music , and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.

Hadoop

Hadoop Data Storage Data Pipeline Data Engineer

Schema Evolution with Case Sensitivity Handling in Snowflake

Cloudyard

JANUARY 21, 2025

However, one of the most common challenges faced by data engineers is the evolution of schemas as new data comes in. Schema evolution refers to the ability of a system to adapt to changes in the structure of incoming data without breaking existing workflows.

Data Schemas

Data Schemas Data Pipeline Data Warehouse Data Storage

UK Government: From cloud first to cloud appropriate?

Cloudera

OCTOBER 1, 2020

Such a status has yet to be granted and without which, data transfers between the UK and the EU will not be lawfully permitted post-December 31st 2020. Without an agreed legislative route to allow data storage and processing in the US and EU, the UK Government will be left with one option; storage and processing within the UK only.

Government

Government Cloud Data Storage Architecture

Data Engineering Weekly #175

Data Engineering Weekly

JUNE 10, 2024

I will write a separate blog on these announcements after the Databricks conference; in the meantime, I found the blog from Cube Research, a balanced article about Snowflake Summit. link] Open AI: Model Spec LLM models are slowly emerging as the intelligent data storage layer. Will they co-exist or fight with each other?

Data Engineering

Data Engineering Data Engineer Engineering Kafka

We’ll See You at the Gartner Data and Analytics Summit

Cloudera

MAY 9, 2024

Hybrid Horses for Courses: The Right Cloud for AI from Pilot to Production at Scale Later, on May 14 at 12:40 pm BST , hear from Mark Samson, one of Cloudera’s solutions engineering directors, on whether a data center or cloud deployment is best for your organization’s data platform and architecture.

Banking

Banking Data Storage Data Analytics Cloud

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Cloudera

NOVEMBER 1, 2023

Managing the data that represents organizational knowledge is easy for any developer and does not require exhaustive cycles of data science work. Utilizing Pinecone for vector data storage over an in-house open-source vector store can be a prudent choice for organizations.

Machine Learning

Machine Learning Data Ingestion Database Architecture

Training Foundation Improvements for Closeup Recommendation Ranker

Pinterest Engineering

SEPTEMBER 26, 2023

We have published a detailed blog post of its modeling architecture. While it is blessed with an abundance of data for training, it is also crucial to maintain a high data storage efficiency. To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site.

Software Engineering

Software Engineering Software Engineer Machine Learning Datasets

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

By focusing on these attributes, data engineers can build pipelines that not only meet current demands but are also prepared for future challenges. In this blog post, we’ll explore key strategies for future-proofing your data pipelines. We’ll explore scalability, integration, security, and cost management.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

Does Cost Reduction Play a Role in Digital Transformation?

Cloudera

OCTOBER 6, 2022

CIO blog post : “Digital transformation is a foundational change in how an organization delivers value to its customers.”. We see this consistently in the data platform/data storage space. . Replacing redundant data storage is a clear opportunity in this category. appeared first on Cloudera Blog.

Data Lake

Data Lake Machine Learning Data Storage Cloud Computing

AWS Shared Responsibility Model – Amazon Web Services

Edureka

APRIL 22, 2025

Under this framework, AWS guarantees the security of the cloud, encompassing physical infrastructure, networking, and virtualization layers, while customers safeguard their workloads, data, and configurations in the cloud. This segregation of duties streamlines operations, empowering organizations to innovate without compromising security.

Amazon Web Services

Amazon Web Services AWS Cloud Data Governance

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Cloudera

MARCH 5, 2024

The powerful platform data security and governance layer, Shared Data Experience (SDX) , is a fundamental part of the open data lakehouse, in the data center just as it is in the cloud. Learn more about the next generation of Cloudera Data Platform for Private Cloud.

Data Lake

Data Lake Data Storage Government Kafka

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

It is especially true in the world of big data. If you want to stay ahead of the curve, you need to be aware of the top big data technologies that will be popular in 2024. In this blog post, we will discuss such technologies. Let's explore the technologies available for big data.

Big Data

Big Data Technology Hadoop NoSQL

Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg

Snowflake

JUNE 3, 2024

The remainder of this blog post provides more detail on functionality and hosting options. You can host it in Snowflake managed infrastructure or your infrastructure of choice. Polaris Catalog will be both open sourced in the next 90 days and available to run in public preview in Snowflake infrastructure soon.

Amazon Web Services

Amazon Web Services Google Cloud Data Architect Government

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

In the first blog, we will share a short summary on the GokuS and GokuL architecture, data format for Goku Long Term, and how we improved the bootstrap time for our storage and serving components. Goku Long Term Storage Architecture Summary and Challenges Figure 9: Flow of data from GokuS to GokuL.

Database

Database Bytes Kafka Architecture

What Is the Difference Between a Database and a Warehouse in Snowflake? | Propel Data Analytics Blog

Propel Data

JULY 27, 2022

Snowflake uses databases for data storage, while a “Snowflake warehouse” is a virtual computing cluster that processes analytical queries.

Database

Database Data Analytics Data Storage Data

Top 7 Mobile Security Threats and Prevention

Edureka

MARCH 20, 2025

Their methods enabled them to intercept sensitive information like personal messages, banking details, and other confidential data. In this blog, we’ll dive into the top 7 mobile security threats that are putting both personal and organizational data at risk and explore effective strategies to defend against these dangers.

Banking

Banking Entertainment Media Transportation

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Data Engineering Podcast

AUGUST 19, 2018

There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How does the query interface and data storage in DGraph differ from other options?

Database

Database PostgreSQL NoSQL Transportation

Streaming Analytics in the Real World

Cloudera

AUGUST 31, 2020

According to Dinesh Chandrasekhar, the Director Product Marketing at Cloudera, data decay – or deterioration – complicates an already complex ecosystem defined by the exponential explosion of data from streaming sources such as IoT. The intelligence revolution driven by fast data is already well underway.

Insurance

Insurance Manufacturing Retail Banking

Top 10 Data Science Websites to learn More

Knowledge Hut

FEBRUARY 29, 2024

File systems can store small datasets, while computer clusters or cloud storage keeps larger datasets. According to a database model, the organization of data is known as database design. The designer must decide and understand the data storage, and inter-relation of data elements.

Data Science

Data Science Datasets Machine Learning Database Design

Data – the Octane Accelerating Intelligent Connected Vehicles

Cloudera

FEBRUARY 8, 2021

Cloudera is proud to provide the underlying data management fabric to the solution – everything from reliably moving connected vehicle data to the Cloud, to providing large scale data storage, processing, analytics and machine learning – the foundations of real-time insights and in-vehicle decision making.” .

Manufacturing

Manufacturing Machine Learning Data Ingestion Electronics

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

AUGUST 3, 2018

Data powers Uber’s global marketplace, enabling more reliable and seamless user experiences across our products for riders, … The post Databook: Turning Big Data into Knowledge with Metadata at Uber appeared first on Uber Engineering Blog.

Metadata

Metadata Big Data Transportation Data

Mastering Day 2 Operations with Cloudera

Cloudera

FEBRUARY 1, 2024

The other half of the equation requires your team’s emphasis to shift to sustained excellence in managing and optimizing your data ecosystem — better known as Day 2 operations. In this blog, we’ll cover the highlights of our recently published Day 2 Operations Guide and why it matters to enterprises.

Cloud

Cloud Architecture Utilities Designing

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

formats — This is a huge part of data engineering. Picking the right format for your data storage. You'll be also asked to put in place a data infrastructure. It means a data warehouse, a data lake or other concepts starting with data. My advice on this point is to learn from others.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Hybrid Data Cloud Success for State and Local Governments

Cloudera

MARCH 29, 2022

This is especially crucial to state and local government IT teams, who must balance their vital missions against resource constraints, compliance requirements, cybersecurity risks, and ever-increasing volumes of data. The post Hybrid Data Cloud Success for State and Local Governments appeared first on Cloudera Blog.

Government

Government Cloud Cloud Computing Data Science

Fraud Prevention – 3 Data Strategies for Financial Services

Cloudera

NOVEMBER 18, 2020

A shared, scalable data store that spans the enterprise enables a holistic approach. A converged data approach enables more comprehensive analysis while reducing duplication of data storage. It can be used by third-party platforms, analysts, data scientists and the lines of business.

Banking

Banking Machine Learning Electronics Data

Unify your data: AI and Analytics in an Open Lakehouse

Cloudera

MAY 30, 2024

This scalability ensures the data lakehouse remains responsive and performant, even as data complexity and usage patterns change over time. Learn more about the Cloudera Open Data Lakehouse here. The post Unify your data: AI and Analytics in an Open Lakehouse appeared first on Cloudera Blog.

Data Lake

Data Lake Data Warehouse Programming Language Data Ingestion

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Both companies have added Data and AI to their slogan, Snowflake used to be The Data Cloud and now they're The AI Data Cloud. I won't delve into every announcement here, but for more details, SELECT has written a blog covering the 28 announcements and takeaways from the Summit.

Metadata

Metadata Data Warehouse BI MySQL

96 Percent of Businesses Can’t Be Wrong: How Hybrid Cloud Came to Dominate the Data Sector

Cloudera

JANUARY 26, 2022

Network operating systems let computers communicate with each other; and data storage grew—a 5MB hard drive was considered limitless in 1983 (when compared to a magnetic drum with memory capacity of 10 kB from the 1960s). The amount of data being collected grew, and the first data warehouses were developed.

Cloud

Cloud Cloud Computing Hadoop Data Warehouse

How to Navigate the Costs of Legacy SIEMS with Snowflake

Snowflake

APRIL 18, 2024

This blog post explores how Snowflake can help with this challenge. Legacy SIEM cost factors to keep in mind Data ingestion: Traditional SIEMs often impose limits to data ingestion and data retention. Now there are a few ways to ingest data into Snowflake. But what if security teams didn’t have to make tradeoffs?

Data Lake

Data Lake Data Ingestion Bytes Cloud Computing

When Private Cloud is the Right Fit for Public Sector Missions

Cloudera

NOVEMBER 1, 2022

Translation: Government agencies — especially those under the Department of Defense (DoD) — have use cases that require data storage and analytic workloads to be maintained on premises to retain absolute control of data security, privacy, and cost predictability. . Learn more about CDP Private Cloud here.

Cloud

Cloud Government Cloud Computing Data Architecture

Data Impact Award Spotlight and Update on 2020’s Industry Transformation Winner: Telkomsel

Cloudera

AUGUST 27, 2021

With more than 25TB of data ingested from over 200 different sources, Telkomsel recognized that to best serve its customers it had to get to grips with its data. . Its initial step in the pursuit of a digital-first strategy saw it turn to Cloudera for a more agile and cost-effective data storage infrastructure.

Telecommunication

Telecommunication Transportation Big Data Data Ingestion

What is ELT (Extract, Load, Transform)? A Beginner’s Guide [SQ]

Databand.ai

JULY 19, 2023

ELT offers a solution to this challenge by allowing companies to extract data from various sources, load it into a central location, and then transform it for analysis. The ELT process relies heavily on the power and scalability of modern data storage systems. The data is loaded as-is, without any transformation.

Data Cleanse

Data Cleanse Data Storage Raw Data Data Warehouse

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

OCTOBER 15, 2021

In this blog post, we will look into benchmark test results measuring the performance of Apache Hadoop Teragen and a directory/file rename operation with Apache Ozone (native o3fs) vs. Ozone S3 API*. The post Apache Ozone – A High Performance Object Store for CDP Private Cloud appeared first on Cloudera Blog. ZooKeeper 3.5.5

Cloud

Cloud Hadoop Data Analytics Metadata

A Flexible and Efficient Storage System for Diverse Workloads

Cloudera

SEPTEMBER 15, 2022

Structured data (such as name, date, ID, and so on) will be stored in regular SQL databases like Hive or Impala databases. There are also newer AI/ML applications that need data storage, optimized for unstructured data using developer friendly paradigms like Python Boto API.

Systems

Systems Hadoop Metadata Telecommunication

How to Use Kafka for Event Streaming in a Microservices Architecture?

Workfall

JUNE 27, 2023

It means that there is a high risk of data loss but Apache Kafka solves this because it is distributed and can easily scale horizontally and other servers can take over the workload seamlessly. Kafka can also be used to stream data from IoT devices or sensors. We will come up with more such use cases in our upcoming blogs.

Kafka

Kafka Architecture AWS Transportation

Getting Started with Cloudera Data Platform Operational Database (COD)

Cloudera

NOVEMBER 23, 2021

HBase is a column-oriented data storage architecture that is formed on top of HDFS to overcome its limitations. The post Getting Started with Cloudera Data Platform Operational Database (COD) appeared first on Cloudera Blog. Build and run the applications. Apache HBase.

Database

Database Non-relational Database NoSQL Government

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Data Engineering Weekly #210

Webinars

Trending Sources

Telco 5G Returns Will Come from Enterprise Data Solutions

Webinars

Top 10 Data Engineering Trends in 2025

Why Open Table Format Architecture is Essential for Modern Data Systems

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

5 Advantages of Real-Time ETL for Snowflake

Data News — Week 23.03

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Schema Evolution with Case Sensitivity Handling in Snowflake

UK Government: From cloud first to cloud appropriate?

Data Engineering Weekly #175

We’ll See You at the Gartner Data and Analytics Summit

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Training Foundation Improvements for Closeup Recommendation Ranker

How To Future-Proof Your Data Pipelines

Does Cost Reduction Play a Role in Digital Transformation?

AWS Shared Responsibility Model – Amazon Web Services

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Big Data Technologies that Everyone Should Know in 2024

Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

What Is the Difference Between a Database and a Warehouse in Snowflake? | Propel Data Analytics Blog

Top 7 Mobile Security Threats and Prevention

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Streaming Analytics in the Real World

Top 10 Data Science Websites to learn More

Data – the Octane Accelerating Intelligent Connected Vehicles

Databook: Turning Big Data into Knowledge with Metadata at Uber

Mastering Day 2 Operations with Cloudera

How to learn data engineering

Hybrid Data Cloud Success for State and Local Governments

Fraud Prevention – 3 Data Strategies for Financial Services

Unify your data: AI and Analytics in an Open Lakehouse

Databricks, Snowflake and the future

96 Percent of Businesses Can’t Be Wrong: How Hybrid Cloud Came to Dominate the Data Sector

How to Navigate the Costs of Legacy SIEMS with Snowflake

When Private Cloud is the Right Fit for Public Sector Missions

Data Impact Award Spotlight and Update on 2020’s Industry Transformation Winner: Telkomsel

What is ELT (Extract, Load, Transform)? A Beginner’s Guide [SQ]

Apache Ozone – A High Performance Object Store for CDP Private Cloud

A Flexible and Efficient Storage System for Diverse Workloads

How to Use Kafka for Event Streaming in a Microservices Architecture?

Getting Started with Cloudera Data Platform Operational Database (COD)

Stay Connected