Architecture and Data Storage - Data Engineering Digest

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Monte Carlo

NOVEMBER 12, 2024

A data engineering architecture is the structural framework that determines how data flows through an organization – from collection and storage to processing and analysis. It’s the big blueprint we data engineers follow in order to transform raw data into valuable insights.

Architecture

Architecture Data Engineering Data Engineer Engineering

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Though basic and easy to use, traditional table storage formats struggle to keep up. Track data files within the table along with their column statistics.

Architecture

Architecture Systems Data Lake Google Cloud

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Warehouses vs. Data Lakes vs. Data Marts: Need Help Deciding?

KDnuggets

OCTOBER 30, 2023

A comparative overview of data warehouses, data lakes, and data marts to help you make informed decisions on data storage solutions for your data architecture.

Data Lake

Data Lake Data Warehouse Data Storage Data

Shift Left: Headless Data Architecture, Part 1

Confluent

OCTOBER 17, 2024

A headless data architecture separates data storage, management, optimization, and access from services that write, process, and query it—creating a single point of access control.

Data Architecture

Data Architecture Architecture Data Storage Data

On-Premise vs Cloud: Where Does the Future of Data Storage Lie?

Monte Carlo

AUGUST 15, 2023

These can sometimes involve running parallel architectures (analytical batches and real-time streams) and trying to reach a level of quality control that is not possible to the degree most would like. Challenges still exist of course.

Data Storage

Data Storage Cloud Metadata Machine Learning

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Späti

NOVEMBER 28, 2018

In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system. I went with Apache Druid for data storage, Apache Superset for querying and Apache Airflow as a task orchestrator.

Data Warehouse

Data Warehouse Data Storage Data Architecture Architecture

How to Use Kafka for Event Streaming in a Microservices Architecture?

Workfall

JUNE 27, 2023

It means that there is a high risk of data loss but Apache Kafka solves this because it is distributed and can easily scale horizontally and other servers can take over the workload seamlessly. Kafka can also be used to stream data from IoT devices or sensors. Let’s get started!

Kafka

Kafka Architecture AWS Transportation

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

Whether it’s customer transactions, IoT sensor readings, or just an endless stream of social media hot takes, you need a reliable way to get that data from point A to point B while doing something clever with it along the way. That’s where data pipeline design patterns come in. Lambda Architecture Pattern 4.

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

How to Design a Modern, Robust Data Ingestion Architecture

Monte Carlo

MAY 28, 2024

A data ingestion architecture is the technical blueprint that ensures that every pulse of your organization’s data ecosystem brings critical information to where it’s needed most. Data Storage : Store validated data in a structured format, facilitating easy access for analysis.

Data Ingestion

Data Ingestion Architecture Designing Hadoop

What is Azure architecture?

Knowledge Hut

MARCH 14, 2024

Azure architecture includes all the ideas and elements needed to build a safe, dependable, and scalable cloud application. The resources are distributed across multiple data centers and global areas, adhering to a distributed paradigm. What Is Microsoft Azure Cloud Architecture? What are the key components of Azure Architecture?

Architecture

Architecture Cloud Computing Utilities Machine Learning

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

Key Differences Between AI Data Engineers and Traditional Data Engineers While traditional data engineers and AI data engineers have similar responsibilities, they ultimately differ in where they focus their efforts. Data Storage Solutions As we all know, data can be stored in a variety of ways.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

It’s challenging to integrate data from various sources and manage massive data pipelines while preserving high performance and dependability. The difficulty is in creating scalable and resilient architectures. Advanced Data Visualisation Tools More people will want data visualization tools that are easier to use.

Data Engineering

Data Engineering Data Engineer Engineering Consulting

Five Ways A Modern Data Architecture Can Reduce Costs in Telco

Cloudera

JUNE 27, 2023

The way to achieve this balance is by moving to a modern data architecture (MDA) that makes it easier to manage, integrate, and govern large volumes of distributed data. When you deploy a platform that supports MDA you can consolidate other systems, like legacy data mediation and disparate data storage solutions.

Data Architecture

Data Architecture Architecture Government Data Governance

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

DataOps Architecture: 5 Key Components and How to Get Started Ryan Yackel August 30, 2023 What Is DataOps Architecture? DataOps is a collaborative approach to data management that combines the agility of DevOps with the power of data analytics. As a result, they can be slow, inefficient, and prone to errors.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

[link] Sneha Ghantasala: Slow Reads for S3 Files in Pandas & How to Optimize it DeepSeek’s Fire-Flyer File System (3FS) re-triggers the importance of an optimized file system for efficient data processing.

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

[link] Adam Bellemare & Thomas Betts: The End of the Bronze Age: Rethinking the Medallion Architecture I’m always a bit uncomfortable with medallion architecture since it is a glorified term for the traditional ETL process. link] All rights reserved ProtoGrowth Inc, India.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Data Lakehouse Architecture Explained: 5 Layers

Monte Carlo

JANUARY 5, 2024

You know what they always say: data lakehouse architecture is like an onion. …ok, Data lakehouse architecture combines the benefits of data warehouses and data lakes, bringing together the structure and performance of a data warehouse with the flexibility of a data lake. Storage layer 3.

Architecture

Architecture Data Lake Metadata Unstructured Data

5 Layers of Data Lakehouse Architecture Explained

Monte Carlo

JANUARY 5, 2024

You know what they always say: data lakehouse architecture is like an onion. …ok, Data lakehouse architecture combines the benefits of data warehouses and data lakes, bringing together the structure and performance of a data warehouse with the flexibility of a data lake. Storage layer 3.

Architecture

Architecture Data Lake Metadata Unstructured Data

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

It stores all the metadata created within a ThoughtSpot instance to enable efficient querying, retrieval, and management of data objects. While Atlas operates as an in-memory graph database for speed and performance, it uses PostgreSQL as its persistent storage layer to ensure durability and long-term data storage.

Metadata

Metadata PostgreSQL Java Database

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Monte Carlo

JUNE 14, 2023

In this post, we will help you quickly level up your overall knowledge of data pipeline architecture by reviewing: Table of Contents What is data pipeline architecture? Why is data pipeline architecture important? What is data pipeline architecture? Why is data pipeline architecture important?

Data Pipeline

Data Pipeline Architecture Data Lake Data Warehouse

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. What is a data lake?

Data Lake

Data Lake Architecture IT Amazon Web Services

What is Data Architecture? Types, Components and Benefits

Hevo

DECEMBER 15, 2024

Introduction to Data Architecture Data architecture shows how data is managed, from collection to transformation to distribution and consumption. It tells about how data flows through the data storage systems. Data architecture is an important piece of data management.

Data Architecture

Data Architecture Architecture Data Storage Data

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Cloudera

NOVEMBER 1, 2023

And so we are thrilled to introduce our latest applied ML prototype (AMP) — a large language model (LLM) chatbot customized with website data using Meta’s Llama2 LLM and Pinecone’s vector database. An overview of the RAG architecture with a vector database used to minimize hallucinations in the chatbot application.

Machine Learning

Machine Learning Data Ingestion Database Architecture

Data Mesh Architecture: Revolutionizing Event Streaming with Striim

Striim

NOVEMBER 8, 2023

Data Mesh is revolutionizing event streaming architecture by enabling organizations to quickly and easily integrate real-time data, streaming analytics, and more. In this article, we will explore the advantages and limitations of data mesh, while also providing best practices for building and optimizing a data mesh with Striim.

Architecture

Architecture Generalist Government Datasets

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Building

Building Portfolio Utilities Data Storage

Making messaging interoperability with third parties safe for users in Europe

Engineering at Meta

MARCH 6, 2024

Our technical solution builds on Meta’s existing client / server architecture We think the best way to deliver interoperability is through a solution which builds on Meta’s existing client / server architecture [Figure 1]. The proof is constructed by the third-party service cryptographically signing an authentication token.

Media

Media Architecture Metadata Data Storage

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

In-flight data processing reduces the time needed for data preparation as it delivers the data in a consumable form.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

UK Government: From cloud first to cloud appropriate?

Cloudera

OCTOBER 1, 2020

Since 2013 the UK Government’s flagship ‘Cloud First’ policy has been at the forefront of enabling departments to shed their legacy IT architecture in order to meaningfully embrace digital transformation. Whilst two of the big three have UK data centres – what happens if they go down?

Government

Government Cloud Data Storage Architecture

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Storage We need efficient data-storage solutions to store the vast amounts of data used in model training. This involves investing in high-capacity and high-speed storage technologies and developing new data-storage solutions for specific workloads.

Algorithm

Algorithm Data Storage Technology Building

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

Data pipelines are the backbone of your business’s data architecture. Implementing a robust and scalable pipeline ensures you can effectively manage, analyze, and organize your growing data. Benjamin Kennedy, Cloud Solutions Architect at Striim, emphasizes the outcome-driven nature of data pipelines.

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Thoughts on Amazon Express One and its impact in Data Infrastructure

Data Engineering Weekly

DECEMBER 2, 2023

The Current State of the Data Architecture S3 intelligent tiered storage provides a fine balance between the cost and the duration of the data retention. However, the real-time insight on accessing the recent data remains a big challenge. The combination of stream processing + OLAP storage like Pinot.

IT

IT BI AWS Kafka

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

Each of these technologies has its own strengths and weaknesses, but all of them can be used to gain insights from large data sets. As organizations continue to generate more and more data, big data technologies will become increasingly essential. Let's explore the technologies available for big data.

Big Data

Big Data Technology Hadoop NoSQL

On-Prem vs. The Cloud: Key Considerations

phData: Data Engineering

FEBRUARY 21, 2025

Prior to making a decision, an organization must consider the Total Cost of Ownership (TCO) for each potential data warehousing solution. On the other hand, cloud data warehouses can scale seamlessly. Vertical scaling refers to the increase in capability of existing computational resources, including CPU, RAM, or storage capacity.

Cloud

Cloud Data Warehouse Amazon Web Services Data Ingestion

We’ll See You at the Gartner Data and Analytics Summit

Cloudera

MAY 9, 2024

Hybrid Horses for Courses: The Right Cloud for AI from Pilot to Production at Scale Later, on May 14 at 12:40 pm BST , hear from Mark Samson, one of Cloudera’s solutions engineering directors, on whether a data center or cloud deployment is best for your organization’s data platform and architecture.

Banking

Banking Data Storage Data Analytics Cloud

Snowflake and the Pursuit Of Precision Medicine

Snowflake

NOVEMBER 29, 2023

For example, the data storage systems and processing pipelines that capture information from genomic sequencing instruments are very different from those that capture the clinical characteristics of a patient from a site. A conceptual architecture illustrating this is shown in Figure 3.

Metadata

Metadata Healthcare Medical Data Storage

Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg

Snowflake

JUNE 3, 2024

Either way, they want the freedom to safely use multiple engines on a single copy of data to minimize the storage and compute costs associated with moving data or maintaining multiple copies. Catalogs play a critical role in a multi-engine architecture.

Amazon Web Services

Amazon Web Services Google Cloud Data Architect Government

Examining Flights in the U.S. with AWS and Power BI

Towards Data Science

JULY 5, 2023

∘ Introduction ∘ Problem Statement ∘ Data ∘ AWS Architecture ∘ Data Storage with AWS S3 ∘ Designing the Schema ∘ ETL with AWS Glue ∘ Data Warehousing with AWS Redshift ∘ Extracting Insights…

AWS

AWS BI Data Storage Architecture

Setting The Stage For The Next Chapter Of The Cassandra Database

Data Engineering Podcast

SEPTEMBER 12, 2021

What are some of the tools and system architectures that users turn to when building analytical workloads for data stored in Cassandra? The architecture of Cassandra has lent itself well to the cloud native ecosystem that has been growing in recent years. Cassandra is primarily used as a system of record.

Database

Database Kafka Metadata Data Storage

How to Navigate the Costs of Legacy SIEMS with Snowflake

Snowflake

APRIL 18, 2024

Legacy SIEM cost factors to keep in mind Data ingestion: Traditional SIEMs often impose limits to data ingestion and data retention. Snowflake allows security teams to store all their data in a single platform and maintain it all in a readily accessible state, with virtually unlimited cloud data storage capacity.

Data Lake

Data Lake Data Ingestion Bytes Cloud Computing

Hands-On Introduction to Delta Lake with (py)Spark

Towards Data Science

FEBRUARY 15, 2023

Concepts, theory, and functionalities of this modern data storage framework Photo by Nick Fewings on Unsplash Introduction I think it’s now perfectly clear to everybody the value data can have. To use a hyped example, models like ChatGPT could only be built on a huge mountain of data, produced and collected over years.

Data Lake

Data Lake Data Warehouse Hadoop Architecture

Exploring The TileDB Universal Data Engine

Data Engineering Podcast

AUGUST 17, 2020

He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. How is the built in data versioning implemented?

Data Engineering

Data Engineering Data Engineer Engineering Database Design

Data News — Week 23.03

Christophe Blefari

JANUARY 20, 2023

There is an introduction post about DataHub — when you look at what you have to run to launch a data catalog: 4 components and 4 different data storage. Don't be surprised if no ones uses data catalogs. When I think that some people are saying Airflow is complex to launch.

Google Cloud

Google Cloud Data Hadoop Machine Learning

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Scalability: How To Build Scalable Pipelines Scalability is a fundamental aspect of future-proofing your data pipelines. As data volumes grow, pipelines must efficiently handle increased loads without compromising performance. Here are three strategies to ensure your pipelines are scalable: a.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Trending Sources

How Apache Iceberg Is Changing the Face of Data Lakes

Webinars

Data Warehouses vs. Data Lakes vs. Data Marts: Need Help Deciding?

Shift Left: Headless Data Architecture, Part 1

On-Premise vs Cloud: Where Does the Future of Data Storage Lie?

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

How to Use Kafka for Event Streaming in a Microservices Architecture?

8 Essential Data Pipeline Design Patterns You Should Know

How to Design a Modern, Robust Data Ingestion Architecture

What is Azure architecture?

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Top 10 Data Engineering Trends in 2025

Five Ways A Modern Data Architecture Can Reduce Costs in Telco

DataOps Architecture: 5 Key Components and How to Get Started

Data Engineering Weekly #210

Data Engineering Weekly #206

Data Lakehouse Architecture Explained: 5 Layers

5 Layers of Data Lakehouse Architecture Explained

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

What is Data Architecture? Types, Components and Benefits

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Data Mesh Architecture: Revolutionizing Event Streaming with Striim

Building Meta’s GenAI Infrastructure

Making messaging interoperability with third parties safe for users in Europe

5 Advantages of Real-Time ETL for Snowflake

UK Government: From cloud first to cloud appropriate?

How Meta trains large language models at scale

A Guide to Data Pipelines (And How to Design One From Scratch)

Thoughts on Amazon Express One and its impact in Data Infrastructure

Big Data Technologies that Everyone Should Know in 2024

On-Prem vs. The Cloud: Key Considerations

We’ll See You at the Gartner Data and Analytics Summit

Snowflake and the Pursuit Of Precision Medicine

Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg

Examining Flights in the U.S. with AWS and Power BI

Setting The Stage For The Next Chapter Of The Cassandra Database

How to Navigate the Costs of Legacy SIEMS with Snowflake

Hands-On Introduction to Delta Lake with (py)Spark

Exploring The TileDB Universal Data Engine

Data News — Week 23.03

How To Future-Proof Your Data Pipelines

Stay Connected