Data Storage and Systems - Data Engineering Digest

A Dive into the Basics of Big Data Storage with HDFS

Analytics Vidhya

FEBRUARY 6, 2023

Introduction HDFS (Hadoop Distributed File System) is not a traditional database but a distributed file system designed to store and process big data. It provides high-throughput access to data and is optimized for […] The post A Dive into the Basics of Big Data Storage with HDFS appeared first on Analytics Vidhya.

Data Storage

Data Storage Big Data Hadoop Datasets

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Though basic and easy to use, traditional table storage formats struggle to keep up. Track data files within the table along with their column statistics.

Architecture

Architecture Systems Data Lake Google Cloud

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Monte Carlo

NOVEMBER 12, 2024

When you click on a show in Netflix, you’re setting off a chain of data-driven processes behind the scenes to create a personalized and smooth viewing experience. As soon as you click, data about your choice flows into a global Kafka queue, which Flink then uses to help power Netflix’s recommendation engine.

Architecture

Architecture Data Engineering Data Engineer Engineering

A Flexible and Efficient Storage System for Diverse Workloads

Cloudera

SEPTEMBER 15, 2022

Structured data (such as name, date, ID, and so on) will be stored in regular SQL databases like Hive or Impala databases. There are also newer AI/ML applications that need data storage, optimized for unstructured data using developer friendly paradigms like Python Boto API.

Systems

Systems Hadoop Metadata Telecommunication

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Metadata Cloud Storage Data Warehouse

What are the Key Parts of Data Engineering?

Start Data Engineering

SEPTEMBER 4, 2024

Key parts of data systems: 2.1. Data flow design 2.3. Data processing design 2.5. Data storage design 2.7. Introduction If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. Introduction 2. Requirements 2.2.

Data Engineering

Data Engineering Data Engineer Engineering Data Storage

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Späti

NOVEMBER 28, 2018

In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system. I went with Apache Druid for data storage, Apache Superset for querying and Apache Airflow as a task orchestrator.

Data Warehouse

Data Warehouse Data Storage Data Architecture Architecture

Telco 5G Returns Will Come from Enterprise Data Solutions

Cloudera

APRIL 22, 2022

Part of this emphasis extends to helping enterprises deal with their data and overall cloud connectivity as well as local networks. At the same time, operators are also becoming more data- and cloud-centric themselves. The focus has also been hugely centred on compute rather than data storage and analysis.

Data Solutions

Data Solutions Amazon Web Services Data Storage Google Cloud

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

A streaming ETL for Snowflake approach loads data to Snowflake from diverse sources such as transactional databases, security systems logs, and IoT sensors/devices in real time , while simultaneously meeting scalability, latency, security, and reliability requirements.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

But what does an AI data engineer do? AI data engineers play a critical role in developing and managing AI-powered data systems. Table of Contents What Does an AI Data Engineer Do? Data Storage Solutions As we all know, data can be stored in a variety of ways. What are they responsible for?

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

ThoughtSpot prioritizes the high availability and minimal downtime of our systems to ensure a seamless user experience. In the realm of modern analytics platforms, where rapid and efficient processing of large datasets is essential, swift metadata access and management are critical for optimal system performance. What is metadata?

Metadata

Metadata PostgreSQL Java Database

Types of Information Systems: 6 Information System Types and Applications

Knowledge Hut

DECEMBER 28, 2023

The information system is a very vast concept that encompasses several aspects like database management, the communication system, various devices, several connections, the internet, collection, organization, and storing data and other information-related applications that are typically used in a business forum.

Systems

Systems Telecommunication Technology Certification

Introduction to AWS Elastic File System (EFS)

Edureka

JULY 4, 2024

Amazon Elastic File System (EFS) is a service that Amazon Web Services ( AWS ) provides. It is intended to deliver serverless, fully-elastic file storage that enables you to share data independently of capacity and performance. All these features make it easier to safeguard your data and also keep to the legal requirements.

AWS

AWS Systems Amazon Web Services Cloud Storage

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

Instead of handling each piece of data as it arrives, you collect it all and process it in scheduled chunks. It’s like having a designated “laundry day” for your data. This approach is super cost-efficient because you’re not running your systems constantly. The downside?

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

Upgrade your Modern Data Stack

Christophe Blefari

SEPTEMBER 28, 2023

When you are a data engineer you're getting paid to build systems that people can rely on. Big data technologies are dead—bye Zookeeper 👋—but data generated by systems are still massive and is the modern data stack relevant to answer this need in storage and processing?

Big Data

Big Data Cloud Storage Hadoop SQL

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. The industry relies more or less on S3 as a de facto data storage, and I found the experimentation on optimizing the S3 read optimization to be an excellent reference.

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

We focused on building end-to-end AI systems with a major emphasis on researcher and developer experience and productivity. Grand Teton builds on the many generations of AI systems that integrate power, control, compute, and fabric interfaces into a single chassis for better overall performance, signal integrity, and thermal performance.

Building

Building Portfolio Utilities Data Storage

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

Evals are introduced to evaluate LLM responses through various techniques, including self-evaluation, using another LLM as a judge, or human evaluation to ensure the system's behavior aligns with intentions. It employs a two-tower model approach to learn query and item embeddings from user engagement data.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Top 7 Mobile Security Threats and Prevention

Edureka

MARCH 20, 2025

In this blog, we’ll dive into the top 7 mobile security threats that are putting both personal and organizational data at risk and explore effective strategies to defend against these dangers. Operating System and App Vulnerabilities No operating system is immune to flaws.

Banking

Banking Entertainment Media Transportation

A Blueprint for a Real-World Recommendation System

Rockset

DECEMBER 19, 2023

From his early days at Quora to leading projects at Facebook and his current venture at Fennel (a real-time feature store for ML), Nikhil has traversed the evolving landscape of machine learning engineering and machine learning infrastructure specifically in the context of recommendation systems.

Systems

Systems Machine Learning Deep Learning Media

Data News — Week 23.08

Christophe Blefari

FEBRUARY 24, 2023

In order to improve your data infra you should sometimes try to occasionally kill your data stack , chaos engineering is something that helps discover issues. This goes further than being a data-driven enterprise , you have to put in place a framework the puts data measurement at every product choice, resulting in maturity increase.

Kafka

Kafka Data Storage Data Lake Data

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

If you pursue the MSc big data technologies course, you will be able to specialize in topics such as Big Data Analytics, Business Analytics, Machine Learning, Hadoop and Spark technologies, Cloud Systems etc. Look for a suitable big data technologies company online to launch your career in the field.

Big Data

Big Data Technology Hadoop NoSQL

The Dawn of the AI-Native Data Stack - Part 1

Data Engineering Weekly

OCTOBER 11, 2024

While the modern data stack has undeniably revolutionized data management with its cloud-native approach, its complexities and limitations are becoming increasingly apparent. Agent systems powered by LLMs are already transforming how we code and interact with data. Data engineering followed a similar path.

Manufacturing

Manufacturing Transportation Data Warehouse Unstructured Data

2026 Will Be The Year of Data + AI Observability

Monte Carlo

MARCH 3, 2025

Prior to data powering valuable data products like machine learning models and real-time marketing applications, data warehouses were mainly used to create charts in binders that sat off to the side of board meetings. For complex systems, it is the only way to identify issues early and trace them back to the root cause.

Unstructured Data

Unstructured Data Data Cloud Computing Banking

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Automate Data Transformation and Orchestration: Automate data cleaning and transformation tasks using a data automation tool like Ascend to reduce manual effort and improve data consistency. API-Driven Integration Incorporating API-driven integration is also essential for future-proofing data pipelines.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Data center deployment Once we’ve chosen a GPU and system, the task of placing them in a data center for optimal usage of resources (power, cooling, networking, etc.) Storage We need efficient data-storage solutions to store the vast amounts of data used in model training.

Algorithm

Algorithm Data Storage Technology Building

Schema Evolution with Case Sensitivity Handling in Snowflake

Cloudyard

JANUARY 21, 2025

Read Time: 6 Minute, 6 Second In modern data pipelines, handling data in various formats such as CSV, Parquet, and JSON is essential to ensure smooth data processing. However, one of the most common challenges faced by data engineers is the evolution of schemas as new data comes in.

Data Schemas

Data Schemas Data Pipeline Data Warehouse Data Storage

On-Prem vs. The Cloud: Key Considerations

phData: Data Engineering

FEBRUARY 21, 2025

Prior to making a decision, an organization must consider the Total Cost of Ownership (TCO) for each potential data warehousing solution. On the other hand, cloud data warehouses can scale seamlessly. Vertical scaling refers to the increase in capability of existing computational resources, including CPU, RAM, or storage capacity.

Cloud

Cloud Data Warehouse Amazon Web Services Data Ingestion

Building a Media Understanding Platform for ML Innovations

Netflix Tech

MARCH 14, 2023

We implemented a batch processing system for users to submit their requests and wait for the system to generate the output. This limited pilot system greatly reduced the time spent by our users to manually analyze the content. Maintaining disparate systems posed a challenge. Processing took several hours to complete.

Media

Media Building Algorithm Machine Learning

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

Here are six key components that are fundamental to building and maintaining an effective data pipeline. Data sources The first component of a modern data pipeline is the data source, which is the origin of the data your business leverages. Data storage Data storage follows.

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Snowflake and the Pursuit Of Precision Medicine

Snowflake

NOVEMBER 29, 2023

For example, the data storage systems and processing pipelines that capture information from genomic sequencing instruments are very different from those that capture the clinical characteristics of a patient from a site. The principles emphasize machine-actionability (i.e.,

Metadata

Metadata Healthcare Medical Data Storage

Top Data Science Jobs for Freshers You Should Know

Knowledge Hut

JANUARY 18, 2024

The opportunities are endless in this field — you can get a job as an operation analyst, quantitative analyst, IT systems analyst, healthcare data analyst, data analyst consultant, and many more. A Python with Data Science course is a great career investment and will pay off great rewards in the future. Choose data sets.

Data Science

Data Science Business Analyst Data Architect ETL Method

Molex Improves Data Sharing, Visibility, and Performance with the Snowflake Manufacturing Data Cloud

Snowflake

SEPTEMBER 25, 2023

A complete view of the enterprise Now, Molex can ingest large volumes of data from customer interactions, SAP production lines, and financial transactions with Snowflake’s cloud-based platform. Data shares are secure, configurable, and controlled completely by the provider account. Access to a share can be revoked at any time.

Manufacturing

Manufacturing Cloud Electronics BI

Setting The Stage For The Next Chapter Of The Cassandra Database

Data Engineering Podcast

SEPTEMBER 12, 2021

Summary The Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. Cassandra is primarily used as a system of record. Since its introduction in 2008 it has been powering systems at every scale.

Database

Database Kafka Metadata Data Storage

Five Reasons Why Platforms Beat Point Solutions in Every Business Case

Cloudera

AUGUST 11, 2021

Point solutions are still used every day in many enterprise systems, but as IT continues to evolve, the platform approach beats point solutions in almost every use case. A few years ago, there were several choices of data deduplication apps for storage, and now, it’s a standard function in every system.

Cloud

Cloud Big Data Cloud Computing Government

Thoughts on Amazon Express One and its impact in Data Infrastructure

Data Engineering Weekly

DECEMBER 2, 2023

The paper discusses trade-offs among data freshness, resource cost, and query performance. Ref: [link] In the current state of the data infrastructure, we use a combination of multiple specialized data storage and processing engines to achieve this balance. Here are a few interesting reads. What is Next?

IT

IT BI AWS Kafka

Data Migration to the Cloud: Benefits and Best Practices

Precisely

OCTOBER 24, 2024

As your systems age, operational costs grow – including the cost of staffing highly specialized individuals to manage legacy technologies. For example, many organizations are now sunsetting older, more expensive systems in favor of cloud technologies that are more widely understood and easier to staff for.

Cloud

Cloud Data Integration Insurance Data

Reflections On Designing A Data Platform From Scratch

Data Engineering Podcast

FEBRUARY 27, 2022

If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing.

Designing

Designing Metadata Data Lake Relational Database

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Cloudera

NOVEMBER 1, 2023

Managing the data that represents organizational knowledge is easy for any developer and does not require exhaustive cycles of data science work. Utilizing Pinecone for vector data storage over an in-house open-source vector store can be a prudent choice for organizations.

Machine Learning

Machine Learning Data Ingestion Database Architecture

Data – the Octane Accelerating Intelligent Connected Vehicles

Cloudera

FEBRUARY 8, 2021

As advanced use cases, like advanced driver assistance systems featuring lane change departure detection, advanced vehicle diagnostics, or predictive maintenance move forward, the existing infrastructure of the connected car is being stressed. billion in 2019, and is projected to reach $225.16 billion by 2027, registering a CAGR of 17.1%

Manufacturing

Manufacturing Machine Learning Data Ingestion Electronics

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Cloudera

JANUARY 6, 2021

Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle.

Machine Learning

Machine Learning Data Science Database Building

Does Cost Reduction Play a Role in Digital Transformation?

Cloudera

OCTOBER 6, 2022

Replace legacy: It’s hard to avoid having “legacy” systems/applications or versions since technology advancements are moving so fast these days. We see this consistently in the data platform/data storage space. . Replacing redundant data storage is a clear opportunity in this category.

Data Lake

Data Lake Machine Learning Data Storage Cloud Computing

Inside Agoda’s Private Cloud - Exclusive

The Pragmatic Engineer

JUNE 13, 2023

For data storage , it uses an object store cluster, running on VAST hardware. In this cluster, around 15 PB of raw data and 21 PB of logical data can be stored. More data can be fitted than there is raw storage available thanks to VAST’s data deduplication.

Cloud

Cloud Database Utilities BI

A Dive into the Basics of Big Data Storage with HDFS

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Trending Sources

Top 10 Hadoop Interview Questions You Must Know

Webinars

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

A Flexible and Efficient Storage System for Diverse Workloads

How Apache Iceberg Is Changing the Face of Data Lakes

What are the Key Parts of Data Engineering?

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Telco 5G Returns Will Come from Enterprise Data Solutions

5 Advantages of Real-Time ETL for Snowflake

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

Types of Information Systems: 6 Information System Types and Applications

Introduction to AWS Elastic File System (EFS)

8 Essential Data Pipeline Design Patterns You Should Know

Upgrade your Modern Data Stack

Data Engineering Weekly #210

Building Meta’s GenAI Infrastructure

Data Engineering Weekly #206

Top 7 Mobile Security Threats and Prevention

A Blueprint for a Real-World Recommendation System

Data News — Week 23.08

Big Data Technologies that Everyone Should Know in 2024

The Dawn of the AI-Native Data Stack - Part 1

2026 Will Be The Year of Data + AI Observability

How To Future-Proof Your Data Pipelines

How Meta trains large language models at scale

Schema Evolution with Case Sensitivity Handling in Snowflake

On-Prem vs. The Cloud: Key Considerations

Building a Media Understanding Platform for ML Innovations

A Guide to Data Pipelines (And How to Design One From Scratch)

Snowflake and the Pursuit Of Precision Medicine

Top Data Science Jobs for Freshers You Should Know

Molex Improves Data Sharing, Visibility, and Performance with the Snowflake Manufacturing Data Cloud

Setting The Stage For The Next Chapter Of The Cassandra Database

Five Reasons Why Platforms Beat Point Solutions in Every Business Case

Thoughts on Amazon Express One and its impact in Data Infrastructure

Data Migration to the Cloud: Benefits and Best Practices

Reflections On Designing A Data Platform From Scratch

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Data – the Octane Accelerating Intelligent Connected Vehicles

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Does Cost Reduction Play a Role in Digital Transformation?

Inside Agoda’s Private Cloud - Exclusive

Stay Connected