Data Storage, Metadata and Structured Data

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Master Nodes control and coordinate two key functions of Hadoop: data storage and parallel processing of data. Worker or Slave Nodes are the majority of nodes used to store data and run computations according to instructions from a master node. HDFS master-slave structure. Data storage options.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

A Flexible and Efficient Storage System for Diverse Workloads

Cloudera

SEPTEMBER 15, 2022

Today’s platform owners, business owners, data developers, analysts, and engineers create new apps on the Cloudera Data Platform and they must decide where and how to store that data. Structured data (such as name, date, ID, and so on) will be stored in regular SQL databases like Hive or Impala databases.

Systems

Systems Hadoop Metadata Telecommunication

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). Tables are governed as per agreed upon company standards.

Big Data

Big Data Data Management Management Metadata

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

This blog will guide you through the best data modeling methodologies and processes for your data lake, helping you make informed decisions and optimize your data management practices. What is a Data Lake? What are Data Modeling Methodologies, and Why Are They Important for a Data Lake?

Data Lake

Data Lake Process Metadata Data Warehouse

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve. NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries.

Media

Media Database Metadata Data Schemas

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

That’s why it’s essential for teams to choose the right architecture for the storage layer of their data stack. But, the options for data storage are evolving quickly. So let’s get to the bottom of the big question: what kind of data storage layer will provide the strongest foundation for your data platform?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Unlocking Effective Data Governance with Unity Catalog – Data Bricks

RandomTrees

SEPTEMBER 17, 2024

The Unity Catalog is Databricks governance solution which integrates with Databricks workspaces and provides a centralized platform for managing metadata, data access, and security. Improved Data Discovery The tagging and documentation features in Unity Catalog facilitate better data discovery.

Data Governance

Data Governance Government Metadata Machine Learning

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

Snowflake can also ingest external tables from on-premise s data sources via S3-compliant data storage APIs. Batch/file-based data is modeled into the raw vault table structures as the hub, link, and satellite tables illustrated at the beginning of this post. Enter Snowpark !

Engineering

Engineering Raw Data Data Science Machine Learning

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. Structured data sources.

Data Lake

Data Lake Architecture IT Amazon Web Services

Hands-On Introduction to Delta Lake with (py)Spark

Towards Data Science

FEBRUARY 15, 2023

Concepts, theory, and functionalities of this modern data storage framework Photo by Nick Fewings on Unsplash Introduction I think it’s now perfectly clear to everybody the value data can have. To use a hyped example, models like ChatGPT could only be built on a huge mountain of data, produced and collected over years.

Data Lake

Data Lake Data Warehouse Hadoop Architecture

Comparing Performance of Big Data File Formats: A Practical Guide

Towards Data Science

JANUARY 17, 2024

Parquet vs ORC vs Avro vs Delta Lake Photo by Viktor Talashuk on Unsplash The big data world is full of various storage systems, heavily influenced by different file formats. These are key in nearly all data pipelines, allowing for efficient data storage and easier querying and information extraction.

Big Data

Big Data Data Data Storage SQL

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

A brief history of data storage The value of data has been apparent for as long as people have been writing things down. Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Unstructured Data: Examples, Tools, Techniques, and Best Practices

AltexSoft

MAY 12, 2023

What is unstructured data? Definition and examples Unstructured data , in its simplest form, refers to any data that does not have a pre-defined structure or organization. It can come in different forms, such as text documents, emails, images, videos, social media posts, sensor data, etc.

Unstructured Data

Unstructured Data NoSQL Hadoop Data Lake

Accelerate your Data Migration to Snowflake

RandomTrees

SEPTEMBER 6, 2020

A combination of structured and semi structured data can be used for analysis and loaded into the cloud database without the need of transforming into a fixed relational scheme first. This stage handles all the aspects of data storage like organization, file size, structure, compression, metadata, statistics.

Cloud Storage

Cloud Storage Data Ingestion Data Cleanse Data Warehouse

Data Lakehouse: Concept, Key Features, and Architecture Layers

AltexSoft

NOVEMBER 10, 2021

In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its raw formats just like data lakes. At the same time, it brings structure to data and empowers data management features similar to those in data warehouses by implementing the metadata layer on top of the store.

Architecture

Architecture Data Lake Data Warehouse Metadata

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Data lakes are useful, flexible data storage repositories that enable many types of data to be stored in its rawest state. Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

Data Lakes vs. Data Warehouses

Grouparoo

JANUARY 11, 2022

A data warehouse is a unified repository where data from diverse sources undergo aggregation and integration into a usable source of information. To achieve this, a data warehouse will require processes to gather and integrate data, manage data quality, create metadata, and support any regulatory compliance and governance procedures.

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

This means that a data warehouse is a collection of technologies and components that are used to store data for some strategic use. Data is collected and stored in data warehouses from multiple sources to provide insights into business data. Data from data warehouses is queried using SQL.

Data Lake

Data Lake Data Warehouse Cloud Hadoop

The Good and the Bad of Hadoop Big Data Framework

AltexSoft

JULY 29, 2022

No matter the actual size, each cluster accommodates three functional layers — Hadoop distributed file systems for data storage, Hadoop MapReduce for processing, and Hadoop Yarn for resource management. How HDFS master-slave structure works. If you divide it by 128 MB, you’ll have 8 data blocks. Let’s see why.

Hadoop

Hadoop Big Data Google Cloud NoSQL

100+ Big Data Interview Questions and Answers 2023

ProjectPro

JANUARY 31, 2023

There are three steps involved in the deployment of a big data model: Data Ingestion: This is the first step in deploying a big data model - Data ingestion, i.e., extracting data from multiple data sources. Data Variety Hadoop stores structured, semi-structured and unstructured data.

Big Data

Big Data Hadoop Relational Database AWS

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

With SQL, machine learning, real-time data streaming, graph processing, and other features, this leads to incredibly rapid big data processing. DataFrames are used by Spark SQL to accommodate structured and semi-structured data. Calcite has chosen to stay out of the data storage and processing business.

Big Data

Big Data Project Metadata Programming Language

Snowflake Architecture and It's Fundamental Concepts

ProjectPro

JANUARY 31, 2022

Source: Snowflake.com The Snowflake data warehouse architecture has three layers - Database Storage Layer Query Processing Layer Cloud Services Layer Database Storage Layer The database storage layer of the Snowflake architecture divides the data into numerous tiny partitions, optimized and compressed internally.

Architecture

Architecture IT Data Warehouse Amazon Web Services

Data Engineering Glossary

Silectis

JANUARY 3, 2021

Data Architecture Data architecture is a composition of models, rules, and standards for all data systems and interactions between them. Data Catalog An organized inventory of data assets relying on metadata to help with data management. Database A collection of structured data.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

AltexSoft

SEPTEMBER 23, 2021

The result of experimentation supplies downstream applications with prepared data. A data hub serves as a gateway to dispense the required data. So the use of unstructured or semi-structured data is also available in a data hub, since a data lake can be a part of it. Azure Data Factory.

Architecture

Architecture Data Lake Unstructured Data Data Warehouse

Now in Public Preview: Processing Files and Unstructured Data with Snowpark for Python

Snowflake

JULY 10, 2023

Processing files in a Python UDF and Stored Procedure has piqued the interest of our data scientists and paves the way for automation of new, complex data pipelines.” Mike Tuck, Air Pollution Specialist Why unstructured data? Files read with a function or procedure are resolved with the function or procedure owner’s context.

Unstructured Data

Unstructured Data Python Process Scala

The Modern Data Stack: What It Is, How It Works, Use Cases, and Ways to Implement

AltexSoft

MARCH 14, 2023

Batch jobs are often scheduled to load data into the warehouse, while real-time data processing can be achieved using solutions like Apache Kafka and Snowpipe by Snowflake to stream data directly into the cloud warehouse. But this distinction has been blurred with the era of cloud data warehouses.

IT

IT Data Warehouse Data Governance Data Lake

Big Data Fabric Weaves Together Automation, Scalability, and Intelligence

Cloudera

JANUARY 22, 2019

Today’s data landscape is characterized by exponentially increasing volumes of data, comprising a variety of structured, unstructured, and semi-structured data types originating from an expanding number of disparate data sources located on-premises, in the cloud, and at the edge.

Big Data

Big Data NoSQL Hadoop Data Lake

The Good and the Bad of Apache Spark Big Data Processing

AltexSoft

JULY 18, 2023

Spark SQL brings native support for SQL to Spark and streamlines the process of querying semistructured and structured data. When transformations are applied to RDDs, Spark records the metadata to build up a DAG, which reflects the sequence of computations performed during the execution of the Spark job.

Big Data

Big Data Data Process Process Hadoop

Data Collection for Machine Learning: Steps, Methods, and Best Practices

AltexSoft

JUNE 26, 2023

Find sources of relevant data. Choose data collection methods and tools. Decide on a sufficient data amount. Set up data storage technology. Below, we’ll elaborate on each step one by one and share our experience of data collection. Key differences between structured, semi-structured, and unstructured data.

Data Collection

Data Collection Machine Learning Unstructured Data Non-relational Database

When to Build vs. Buy Your Data Warehouse (5 Key Factors)

Monte Carlo

JANUARY 25, 2023

In this article, we’ll take a closer look at the data storage level of the data stack to determine when to invest in storage and compute tooling, what “build versus buy” really means when it comes to storage and compute, and how our five considerations might impact your decision. Let’s jump in!

Data Warehouse

Data Warehouse Building Data Lake Data Storage

50 PySpark Interview Questions and Answers For 2023

ProjectPro

NOVEMBER 22, 2021

StructType is a collection of StructField objects that determines column name, column data type, field nullability, and metadata. PySpark imports the StructType class from pyspark.sql.types to describe the DataFrame's structure. Spark saves data in memory (RAM), making data retrieval quicker and faster when needed.

Hadoop

Hadoop Python Datasets Metadata

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Knowledge Hut

DECEMBER 21, 2023

Different databases have different patterns of data storage. For instance, MongoDB stores data in a semi-structured pattern, Cassandra stores data in the form of columns, and Redis stores data as key-value pairs. Avro creates binary data which can be both compressed as well as split.

Hadoop

Hadoop Big Data NoSQL Unstructured Data

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Monte Carlo

JUNE 14, 2023

In this post, we will help you quickly level up your overall knowledge of data pipeline architecture by reviewing: Table of Contents What is data pipeline architecture? Why is data pipeline architecture important? This is frequently referred to as a 5 or 7 layer (depending on who you ask) data stack like in the image below.

Data Pipeline

Data Pipeline Architecture Data Lake Data Warehouse

Data Governance: Concept, Models, Framework, Tools, and Implementation Best Practices

AltexSoft

MARCH 2, 2023

Data modeling involves creating a conceptual representation of data objects and their relationships to each other, as well as the rules governing those relationships. To design an effective data governance program, it’s crucial to choose an operational model that fits your business size and structure.

Data Governance

Data Governance Government Programming Healthcare

70+ Azure Interview Questions and Answers to Prepare in 2023

ProjectPro

DECEMBER 10, 2021

The service provider's data center hosts the underlying infrastructure, software, and app data. Azure Redis Cache is an in-memory data storage, or cache system, based on Redis that boosts the flexibility and efficiency of applications that rely significantly on backend data stores. Define table storage in Azure.

BI

BI Cloud Computing SQL Database

How to Join Data in Elasticsearch vs Rockset

Rockset

DECEMBER 22, 2020

We will also need to store this data in Elasticsearch. By using Rockset, we may have to Tokenize our search fields on ingestion however we make up for it in firstly, the simplicity of processing this data on ingestion as well as easier querying, joining, and aggregating data.

SQL

SQL Data MongoDB Building

The Ultimate Modern Data Stack Migration Guide

phData: Data Engineering

JULY 18, 2023

Demands on the cloud data warehouse are also evolving to require it to become more of an all-in-one platform for an organization’s analytics needs. Enter Snowflake The Snowflake Data Cloud is one of the most popular and powerful CDW providers. This noticeably saves time on copying and drastically reduces data storage costs.

Data Warehouse

Data Warehouse Pipeline-centric Government Data

How Apache Iceberg Is Changing the Face of Data Lakes

Hadoop vs Spark: Main Big Data Tools Explained

Trending Sources

A Flexible and Efficient Storage System for Diverse Workloads

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Implementing the Netflix Media Database

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Unlocking Effective Data Governance with Unity Catalog – Data Bricks

Data Vault on Snowflake: Feature Engineering and Business Vault

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Hands-On Introduction to Delta Lake with (py)Spark

Comparing Performance of Big Data File Formats: A Practical Guide

Data Lake vs. Data Warehouse vs. Data Lakehouse

Unstructured Data: Examples, Tools, Techniques, and Best Practices

Accelerate your Data Migration to Snowflake

Data Lakehouse: Concept, Key Features, and Architecture Layers

Top Data Lake Vendors (Quick Reference Guide)

Data Lakes vs. Data Warehouses

Data Lake vs Data Warehouse - Working Together in the Cloud

The Good and the Bad of Hadoop Big Data Framework

100+ Big Data Interview Questions and Answers 2023

20 Best Open Source Big Data Projects to Contribute on GitHub

Snowflake Architecture and It's Fundamental Concepts

Data Engineering Glossary

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

Now in Public Preview: Processing Files and Unstructured Data with Snowpark for Python

The Modern Data Stack: What It Is, How It Works, Use Cases, and Ways to Implement

Big Data Fabric Weaves Together Automation, Scalability, and Intelligence

The Good and the Bad of Apache Spark Big Data Processing

Data Collection for Machine Learning: Steps, Methods, and Best Practices

When to Build vs. Buy Your Data Warehouse (5 Key Factors)

50 PySpark Interview Questions and Answers For 2023

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Data Governance: Concept, Models, Framework, Tools, and Implementation Best Practices

Top 100 Hadoop Interview Questions and Answers 2023

70+ Azure Interview Questions and Answers to Prepare in 2023

How to Join Data in Elasticsearch vs Rockset

The Ultimate Modern Data Stack Migration Guide

Stay Connected