Data Storage and Database - Data Engineering Digest

A Dive into the Basics of Big Data Storage with HDFS

Analytics Vidhya

FEBRUARY 6, 2023

Introduction HDFS (Hadoop Distributed File System) is not a traditional database but a distributed file system designed to store and process big data. It provides high-throughput access to data and is optimized for […] The post A Dive into the Basics of Big Data Storage with HDFS appeared first on Analytics Vidhya.

Data Storage

Data Storage Big Data Hadoop Datasets

What is Data Storage and How is it Used?

Analytics Vidhya

MAY 24, 2023

The complexity of information storage technologies increases exponentially with the growth of data. From physical hard drives to cloud computing, unravel the captivating world of data storage and recognize its ever-evolving role in our […] The post What is Data Storage and How is it Used?

Data Storage

Data Storage IT Cloud Computing Cloud

From Oracle to Databases for AI: The Evolution of Data Storage

KDnuggets

FEBRUARY 15, 2022

From Oracle, to NoSQL databases, and beyond, read about data management solutions from the early days of the RBDMS to those supporting AI applications.

Data Storage

Data Storage Database NoSQL Data

What Is AWS DMS And Why You Shouldn’t Use It As An ELT

Seattle Data Guy

NOVEMBER 8, 2024

Whether it was moving data from a local database instance to S3 or some other data storage layer. As… Read more The post What Is AWS DMS And Why You Shouldn’t Use It As An ELT appeared first on Seattle Data Guy. It was interesting to see AWS DMS used in this manner.

AWS

AWS IT Data Storage Database

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

Seattle Data Guy

DECEMBER 19, 2022

Whether its helping increase revenue by finding new customers or reducing costs, all of it starts with data.

Data Lake

Data Lake Data Warehouse Database Data Storage

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Getting Started with Cloudera Data Platform Operational Database (COD)

Cloudera

NOVEMBER 23, 2021

What is Cloudera Operational Database (COD)? Operational Database is a relational and non-relational database built on Apache HBase and is designed to support OLTP applications, which use big data. The operational database in Cloudera Data Platform has the following components: . Select Operational Database.

Database

Database Non-relational Database NoSQL Government

Setting The Stage For The Next Chapter Of The Cassandra Database

Data Engineering Podcast

SEPTEMBER 12, 2021

Summary The Cassandra database is one of the first open source options for globally scalable storage systems. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. Since its introduction in 2008 it has been powering systems at every scale.

Database

Database Kafka Metadata Data Storage

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Cloudera

JANUARY 6, 2021

Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle.

Machine Learning

Machine Learning Data Science Database Building

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

Goku is our in-house time series database providing cost efficient and low latency storage for metrics data. Goku Long Term Storage Architecture Summary and Challenges Figure 9: Flow of data from GokuS to GokuL. short, RocksDB is a key value store that uses a log structure DB engine for storage and retrieval.

Database

Database Bytes Kafka Architecture

A Dive into Apache Flume: Installation, Setup, and Configuration

Analytics Vidhya

MARCH 7, 2023

Introduction Apache Flume is a tool/service/data ingestion mechanism for gathering, aggregating, and delivering huge amounts of streaming data from diverse sources, such as log files, events, and so on, to centralized data storage. Flume is a tool that is very dependable, distributed, and customizable.

Data Ingestion

Data Ingestion Data Storage Hadoop Data

Value Proposition of the Cloudera Operational Database over Legacy Apache HBase Deployments

Cloudera

SEPTEMBER 9, 2021

The CDP Operational Database ( COD ) builds on the foundation of existing operational database capabilities that were available with Apache HBase and/or Apache Phoenix in legacy CDH and HDP deployments. Cloudera Machine Learning or Cloudera Data Warehouse), to deliver fast data and analytics to downstream components.

Database

Database AWS Relational Database Cloud

On-Premise vs Cloud: Where Does the Future of Data Storage Lie?

Monte Carlo

AUGUST 15, 2023

These use cases are typically the first and easiest behavior shift for data teams once they enter the cloud. They are: Moving from ETL to ELT to accelerate time-to-insight You can’t just load anything into your on-premise database– especially not if you want a query to return before you hit the weekend.

Data Storage

Data Storage Cloud Metadata Machine Learning

A Comprehensive Guide to Data Lake vs. Data Warehouse

Analytics Vidhya

FEBRUARY 2, 2023

Introduction In this constantly growing era, the volume of data is increasing rapidly, and tons of data points are produced every second. Now, businesses are looking for different types of data storage to store and manage their data effectively.

Data Lake

Data Lake Data Warehouse Data Storage Data

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Though basic and easy to use, traditional table storage formats struggle to keep up. Track data files within the table along with their column statistics. Contact phData Today!

Architecture

Architecture Systems Data Lake Google Cloud

Inside Agoda’s Private Cloud - Exclusive

The Pragmatic Engineer

JUNE 13, 2023

Agoda co-locates in all data centers, leasing space for its racks and the largest data center consumes about 1 MW of power. It uses Spark for the data platform. For transactional databases, it’s mostly the Microsoft SQL Server, but also other databases like PostgreSQL, ScyllaDB and Couchbase.

Cloud

Cloud Database Utilities BI

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

The foundational skills are similar between traditional data engineers and AI data engineers are similar, with AI data engineers more heavily focused on machine learning data infrastructure, AI-specific tools, vector databases, and LLM pipelines. Let’s dive into the tools necessary to become an AI data engineer.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

A streaming ETL for Snowflake approach loads data to Snowflake from diverse sources such as transactional databases, security systems logs, and IoT sensors/devices in real time , while simultaneously meeting scalability, latency, security, and reliability requirements.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

Difference Between Data Structure and Database

Knowledge Hut

MARCH 27, 2024

Think of a database as a smart, organized library that stores and manages information efficiently. On the other hand, data structures are like the tools that help organize and arrange data within a computer program. What is a Database? SQL, or structured query language, is widely used for writing and querying data.

Database

Database Relational Database Algorithm Data Storage

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

This allows users to interact with their data without interruption, regardless of system scale. This article highlights the performance optimizations implemented to initialize Atlas, our in-house Graph database, in less than two minutes. What is metadata? What is Atlas?

Metadata

Metadata PostgreSQL Java Database

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

This switch has been lead by modern data stack vision. In terms of paradigms before 2012 we were doing ETL because storage was expensive, so it became a requirement to transform data before the data storage—mainly a data warehouse, to have the most optimised data for querying.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Cloudera

NOVEMBER 1, 2023

And so we are thrilled to introduce our latest applied ML prototype (AMP) — a large language model (LLM) chatbot customized with website data using Meta’s Llama2 LLM and Pinecone’s vector database. High-level overview of real-time data ingest with Cloudera DataFlow to Pinecone vector database.

Machine Learning

Machine Learning Data Ingestion Database Architecture

Data News — Week 22.45

Christophe Blefari

NOVEMBER 11, 2022

Kovid wrote an article that tries to explain what are the ingredients of a data warehouse. A data warehouse is a piece of technology that acts on 3 ideas: the data modeling, the data storage and processing engine. Slowly, years after years, graph databases time is coming up. And he does it well.

BI

BI Data Warehouse Data Database

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

Each of these technologies has its own strengths and weaknesses, but all of them can be used to gain insights from large data sets. As organizations continue to generate more and more data, big data technologies will become increasingly essential. Let's explore the technologies available for big data.

Big Data

Big Data Technology Hadoop NoSQL

What is Azure SQL Database? A Complete Guide

Knowledge Hut

MARCH 14, 2024

Do you want a database system that can scale quickly and manage heavy workloads? Should that be the case, Azure SQL Database might be your best bet. Microsoft SQL Server's functionalities are fully included in Azure SQL Database, a cloud-based database service that also offers greater flexibility and scalability.

Database

Database SQL Relational Database BI

14 Best Database Certifications in 2023 to Boost Your Career

Knowledge Hut

SEPTEMBER 6, 2023

Back when I studied Computer Science in the early 2000s, databases like MS Access and Oracle ruled. The rise of big data and NoSQL changed the game. Systems evolved from simple to complex, and we had to split how we find data from where we store it. What Is a Database? Now, it's different. Let’s begin!

Certification

Certification Database MongoDB MySQL

Schema Evolution with Case Sensitivity Handling in Snowflake

Cloudyard

JANUARY 21, 2025

As new data comes in on day 2, we may have additional columns like PHONE and EMAIL. Handling Parquet Data with Schema Evolution Let’s now look at how schema evolution works with Parquet files. Parquet is a columnar storage format, often used for its efficient data storage and retrieval.

Data Schemas

Data Schemas Data Pipeline Data Warehouse Data Storage

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Ensure the provider supports the infrastructure necessary for your data needs, such as managed databases, storage, and data pipeline services. Utilize Cloud-Native Tools: Leverage cloud-native data pipeline tools like Ascend to build and orchestrate scalable workflows.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

Data News — Week 23.03

Christophe Blefari

JANUARY 20, 2023

In between the Hadoop era, the modern data stack and the machine learning revolution everyone—but me—waits for. But, funny, in the end we are still copying data from database to database by using CSVs, like 40 years ago. Don't be surprised if no ones uses data catalogs.

Google Cloud

Google Cloud Data Hadoop Machine Learning

Top Data Science Jobs for Freshers You Should Know

Knowledge Hut

JANUARY 18, 2024

Report data findings to management Monitor data collection. Data Analyst Data analysts’ job is to create and maintain data systems and databases. Statistical tools are used to interpret data by the data analyst. Roles and Responsibilities Develop, design, and create data models.

Data Science

Data Science Business Analyst Data Architect ETL Method

Data News — Week 23.38 (late)

Christophe Blefari

SEPTEMBER 25, 2023

Mainly Motherduck is the company providing DuckDB as a Cloud product but they are not developing DuckDB, their product is quite young but works like expected: with a simple string you can get an analytical cloud database that just works and that can be instantly replaced by a local one if needed. Providing more control over data storage.

Data

Data Data Warehouse Data Storage Cloud

Data News — Week 23.38 (late)

Christophe Blefari

SEPTEMBER 25, 2023

Mainly Motherduck is the company providing DuckDB as a Cloud product but they are not developing DuckDB, their product is quite young but works like expected: with a simple string you can get an analytical cloud database that just works and that can be instantly replaced by a local one if needed. Providing more control over data storage.

Data

Data Data Warehouse Data Storage Cloud

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

The ingestion layer supports multiple data types and formats, including: Batch Data: Data collected and processed in discrete chunks, typically from static sources such as databases or logs. Data storage Data storage follows. How will we connect to the data sources?

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Exploring The TileDB Universal Data Engine

Data Engineering Podcast

AUGUST 17, 2020

Summary Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. How is the built in data versioning implemented?

Data Engineering

Data Engineering Data Engineer Engineering Database Design

Top 10 Cloud Computing Companies of 2024

Knowledge Hut

MARCH 7, 2024

AWS provides more than 200 fully featured services which include storage, database, and computing. Some prominent cloud services offered by Alibaba Cloud include database storage, large-scale computing, network visualization, elastic computing, big data analytics, and management services.

Cloud Computing

Cloud Computing Amazon Web Services Cloud Google Cloud

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Both companies have added Data and AI to their slogan, Snowflake used to be The Data Cloud and now they're The AI Data Cloud. Native CDC for Postgres and MySQL — Snowflake will be able to connect to Postgres and MySQL to natively move data from your databases to the warehouse.

Metadata

Metadata Data Warehouse BI MySQL

Data News — Week 23.24

Christophe Blefari

JUNE 16, 2023

The ultimate SQL guide — After the last canva on data interviews, here's a canva to learn SQL. From databases introduction to SQL writing. It covers simple SELECT and advanced concepts. This is neat. Malloy's Near Term Roadmap — I've shared recently Malloy demo , which was awesome. but I missed it).

Programming Language

Programming Language SQL PostgreSQL Data

Reflections On Designing A Data Platform From Scratch

Data Engineering Podcast

FEBRUARY 27, 2022

If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing.

Designing

Designing Metadata Data Lake Relational Database

Planet Scale SQL For The New Generation Of Applications With YugabyteDB

Data Engineering Podcast

JANUARY 13, 2020

This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. A growing trend in database engines (e.g.

SQL

SQL PostgreSQL MongoDB Database

Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop

Data Engineering Podcast

AUGUST 14, 2021

Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security.

Unstructured Data

Unstructured Data Machine Learning Data Lake SQL

Top 12 Backend Developer Skills You Must Know in 2024

Knowledge Hut

APRIL 25, 2024

Here are some things that you should learn: Recursion Bubble sort Selection sort Binary Search Insertion Sort Databases and Cache To build a high-performance system, programmers need to rely on the cache. In addition, it is required in a database to keep track of the users' responses. to manage DBMS.

Programming Language

Programming Language Java Algorithm MySQL

Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 3)

Pinterest Engineering

SEPTEMBER 9, 2024

Goku is our in-house time series database that provides cost efficient and low latency storage for metrics data. The Observability team has a tool that uses this data, applies some heuristics on top (not in the scope of this document), and determines metrics which should beblocked.

Database

Database Bytes Kafka Software Engineer

Data Lake vs Data Warehouse vs Database: Top 5 Differences

Hevo

SEPTEMBER 11, 2024

Nowadays, the term is used for petabytes or even exabytes of data (1024 Petabytes), close to trillions of records from billions of people. In this fast-moving landscape, the key to making a difference is picking up the correct data storage solution for your business. […]

Data Lake

Data Lake Data Warehouse Database Data Storage

A Dive into the Basics of Big Data Storage with HDFS

What is Data Storage and How is it Used?

Trending Sources

From Oracle to Databases for AI: The Evolution of Data Storage

What Is AWS DMS And Why You Shouldn’t Use It As An ELT

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

How Apache Iceberg Is Changing the Face of Data Lakes

Getting Started with Cloudera Data Platform Operational Database (COD)

Setting The Stage For The Next Chapter Of The Cassandra Database

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Top 10 Hadoop Interview Questions You Must Know

A Dive into Apache Flume: Installation, Setup, and Configuration

Value Proposition of the Cloudera Operational Database over Legacy Apache HBase Deployments

On-Premise vs Cloud: Where Does the Future of Data Storage Lie?

A Comprehensive Guide to Data Lake vs. Data Warehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

Inside Agoda’s Private Cloud - Exclusive

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

5 Advantages of Real-Time ETL for Snowflake

Difference Between Data Structure and Database

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

How to get started with dbt

Harness the Power of Pinecone with Cloudera’s New Applied Machine Learning Prototype

Data News — Week 22.45

Big Data Technologies that Everyone Should Know in 2024

What is Azure SQL Database? A Complete Guide

14 Best Database Certifications in 2023 to Boost Your Career

Schema Evolution with Case Sensitivity Handling in Snowflake

How To Future-Proof Your Data Pipelines

Data News — Week 23.03

Top Data Science Jobs for Freshers You Should Know

Data News — Week 23.38 (late)

Data News — Week 23.38 (late)

A Guide to Data Pipelines (And How to Design One From Scratch)

Exploring The TileDB Universal Data Engine

Top 10 Cloud Computing Companies of 2024

Databricks, Snowflake and the future

Data News — Week 23.24

Reflections On Designing A Data Platform From Scratch

Planet Scale SQL For The New Generation Of Applications With YugabyteDB

Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop

Top 12 Backend Developer Skills You Must Know in 2024

Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 3)

Data Lake vs Data Warehouse vs Database: Top 5 Differences

Stay Connected