Accessible, Database and Systems - Data Engineering Digest

Designing A Non-Relational Database Engine

Data Engineering Podcast

APRIL 14, 2024

Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. Can you describe what constitutes a NoSQL database? If you were to start from scratch today, what database would you build?

Non-relational Database

Non-relational Database Relational Database Database Designing

Redefining AIOps IT Workflows with Legacy System Visibility

Precisely

DECEMBER 16, 2024

Modern IT environments require comprehensive data for successful AIOps, that includes incorporating data from legacy systems like IBM i and IBM Z into ITOps platforms. AIOps presents enormous promise, but many organizations face hurdles in its implementation: Complex ecosystems made of multiple, fragmented systems that lack interoperability.

Systems

Systems IT Machine Learning Insurance

The Roots of Today's Modern Backend Engineering Practices

The Pragmatic Engineer

NOVEMBER 21, 2023

If you had a continuous deployment system up and running around 2010, you were ahead of the pack: but today it’s considered strange if your team would not have this for things like web applications. We dabbled in network engineering, database management, and system administration. and hand-rolled C -code.

Engineering

Engineering Bytes Cloud Computing AWS

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

MAY 26, 2024

Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems?

Systems

Systems Data Lake High Quality Data Google Cloud

Monetizing Analytics Features: Why Data Visualizations Will Never Be Enough

Think your customers will pay more for data visualizations in your application? Five years ago they may have. But today, dashboards and visualizations have become table stakes. Discover which features will differentiate your application and maximize the ROI of your embedded analytics. Brought to you by Logi Analytics.

Data

A Beginner’s Guide to Geospatial with DuckDB

Simon Späti

FEBRUARY 26, 2025

Traditionally, answering this question would require expensive GIS (Geographic Information Systems) software or complex database setups. Today, DuckDB offers a simpler, more accessible approach for data engineers to tackle spatial problems without specialized infrastructure.

Database

Database Data Engineer Data Engineering Accessibility

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. These systems are built on open standards and offer immense analytical and transactional processing flexibility. These formats are transforming how organizations manage large datasets.

Architecture

Architecture Systems Data Lake Google Cloud

Reconciling The Data In Your Databases With Datafold

Data Engineering Podcast

MARCH 17, 2024

Summary A significant portion of data workflows involve storing and processing information in database engines. Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about how to reconcile data in database environments Interview Introduction How did you get involved in the area of data management?

Database

Database Data Lake High Quality Data Data Workflow

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. This was the core of your recent re-write of the InfluxDB engine.

Database

Database Technology Data Lake High Quality Data

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data.

Systems

Systems Designing Data Lake SQL

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

These are all big questions about the accessibility, quality, and governance of data being used by AI solutions today. The simple idea was, hey how can we get more value from the transactional data in our operational systems spanning finance, sales, customer relationship management, and other siloed functions.

Data Integration

Data Integration Hadoop Data Warehouse Data Lake

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. It enhances the traceability of data flows within systems, ultimately empowering developers to swiftly implement privacy controls and create innovative products. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

Zenlytic Is Building You A Better Coworker With AI Agents

Data Engineering Podcast

MAY 18, 2024

Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Support Data Engineering Podcast Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions.

Building

Building Data Lake High Quality Data Business Intelligence

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

The startup was able to start operations thanks to getting access to an EU grant called NGI Search grant. The current database includes 2,000 server types in 130 regions and 340 zones. Results are stored in git and their database, together with benchmarking metadata. Each benchmarking task is evaluated sequentially.

Cloud

Cloud AWS Metadata Cloud Computing

Weekend maintenance kicks an Italian bank offline for days

The Pragmatic Engineer

APRIL 11, 2024

From Sella’s status page : “Following the installation of an update to the operating system and related firmware which led to an unstable situation. The changes messed up all major databases in some unexpected way. Still, I’m puzzled by how long the system has been down.

Banking

Banking Utilities Database Engineering

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it.

Systems

Systems Metadata Data Pipeline MongoDB

Building Pinterest’s new wide column database using RocksDB

Pinterest Engineering

JANUARY 4, 2024

In 2020, anticipating the growing needs of the business and to simplify our storage offerings, we decided to consolidate our different key-value systems in the company into a single unified service called KVStore. Additionally, the last section explains how this new database supports a key platform in the product.

Database

Database Building Datasets Relational Database

A Dive into the Basics of Big Data Storage with HDFS

Analytics Vidhya

FEBRUARY 6, 2023

Introduction HDFS (Hadoop Distributed File System) is not a traditional database but a distributed file system designed to store and process big data. It provides high-throughput access to data and is optimized for […] The post A Dive into the Basics of Big Data Storage with HDFS appeared first on Analytics Vidhya.

Data Storage

Data Storage Big Data Hadoop Datasets

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Striim

JANUARY 30, 2025

Todays organizations have access to more data than ever before, and consequently are faced with the challenge of determining how to transform this tremendous stream of real-time information into actionable insights. Encryption, access controls, and regulatory compliance (HIPAA, GDPR, etc.) patient records or geolocation data).

Systems

Systems Management Hospitality Healthcare

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

A consolidated data system to accommodate a big(ger) WHOOP When a company experiences exponential growth over a short period, it’s easy for its data foundation to feel a bit like it was built on the fly. This blog post is the second in a three-part series on migrations. million in cost savings annually.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Paying down tech debt: further learnings

The Pragmatic Engineer

SEPTEMBER 19, 2024

In the early 90’s, DOS programs like the ones my company made had its own Text UI screen rendering system. This rendering system was easy for me to understand, even on day one. Our rendering system was very memory inefficient, but that could be fixed. By doing so, I got to see every screen of the system.

Recruitment

Recruitment Java Coding Project

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Unify transactional and analytical workloads in Snowflake for greater simplicity Many businesses must maintain two separate databases: one to handle transactional workloads and another for analytical workloads. Sensitive data can have enormous value but is oftentimes locked down due to privacy requirements.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Change Data Capture (CDC) is a crucial technology that enables organizations to efficiently track and capture changes in their databases. In this blog post, we’ll explore what CDC is, why it’s important, and our journey of implementing Generic CDC solutions for all online databases at Pinterest. What is Change Data Capture?

Kafka

Kafka MySQL Database Software Engineer

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data.

Kafka

Kafka Data Lake High Quality Data SQL

Indexing code at scale with Glean

Engineering at Meta

DECEMBER 19, 2024

Were sharing details about Glean , Metas open source system for collecting, deriving and working with facts about source code. In this blog post well talk about why a system like Glean is important, explain the rationale for Gleans design, and run through some of the ways were using Glean to supercharge our developer tooling at Meta.

Coding

Coding Programming Language SQL Programming

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. Then you begin researching database objects and find a couple of views, but there are some inconsistencies between them so you do not know which one to use. Your first step might be to locate the orders. Does it sound familiar?

Systems

Systems Raw Data Metadata Data Cleanse

Datadog’s $65M/year customer mystery solved

The Pragmatic Engineer

MAY 11, 2023

A quick summary of these technologies: Prometheus : a time series database. A very popular open-source solution for systems and services monitoring. A fast and open-source column-oriented database management system, which is a popular choice for log management. It evaluates rules and can trigger alerts.

AWS

AWS Software Engineer Software Engineering Google Cloud

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

ThoughtSpot prioritizes the high availability and minimal downtime of our systems to ensure a seamless user experience. In the realm of modern analytics platforms, where rapid and efficient processing of large datasets is essential, swift metadata access and management are critical for optimal system performance. What is Atlas?

Metadata

Metadata PostgreSQL Java Database

End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

Towards Data Science

FEBRUARY 9, 2024

This involves getting data from an API and storing it in a PostgreSQL database. In the second phase, we’ll develop an application that uses a language model to interact with this database. The second article, which will come later, will delve into creating agents using tools like LangChain to communicate with external databases.

Kafka

Kafka Data Engineer Data Engineering PostgreSQL

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

Optimize performance and cost with a broader range of model options Cortex AI provides easy access to industry-leading models via LLM functions or REST APIs, enabling you to focus on driving generative AI innovations. We offer a broad selection of models in various sizes, context window lengths and language supports.

Unstructured Data

Unstructured Data SQL AWS Healthcare

Behind the Scenes with Two New Salary Transparency Websites

The Pragmatic Engineer

APRIL 6, 2023

Our hope is that making salary ranges more accessible on Comprehensive.io For AI, we’ve built a system to efficiently use GPT-4 for this purpose, including auto-crafting prompts and performing pre and post-processing. on the backend, and Postgres for database storage.” ” How does Comprehensive.io

Software Engineer

Software Engineer Software Engineering Datasets Database

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

Astasia Myers: The three components of the unstructured data stack LLMs and vector databases significantly improved the ability to process and understand unstructured data. I never thought of PDF as a self-contained document database, but that seems a reality that we can’t deny. What are you waiting for?

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Snowflake’s Fully Managed Service: Beyond Serverless

Snowflake

FEBRUARY 13, 2025

Furthermore, most vendors require valuable time and resources for cluster spin-up and spin-down, disruptive upgrades, code refactoring or even migrations to new editions to access features such as serverless capabilities and performance improvements.

Management

Management Government Cloud Unstructured Data

Inside Agoda’s Private Cloud - Exclusive

The Pragmatic Engineer

JUNE 13, 2023

For transactional databases, it’s mostly the Microsoft SQL Server, but also other databases like PostgreSQL, ScyllaDB and Couchbase. queries per second as total load, spread across its managed database-as-a-service (DBAAS.) It uses Spark for the data platform. At peak load, Agoda sees around 7.5M

Cloud

Cloud Database Utilities BI

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. Can you describe what RisingWave is and the story behind it?

SQL

SQL Data Lake High Quality Data Machine Learning

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

This architecture is valuable for organizations dealing with large volumes of diverse data sources, where maintaining accuracy and accessibility at every stage is a priority. This foundational layer is a repository for various data types, from transaction logs and sensor data to social media feeds and system logs.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

APRIL 7, 2024

Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. dbt, BI, warehouse marts, etc.)

Data Lake

Data Lake High Quality Data BI Data Workflow

Evaluating Change Data Capture Tools: A Comprehensive Guide

Data Engineering Weekly

AUGUST 6, 2024

CDC Evaluation Guide Google Sheet Link: [link] CDC Evaluation Guide Github Link: [link] Change Data Capture (CDC) is a powerful technology in data engineering that allows for continuously capturing changes (inserts, updates, and deletes) made to source systems. However, managing data consistency across microservices can be challenging.

Data Lake

Data Lake Data Warehouse Database Data Architecture

Stop Overcomplicating Data Quality

Towards Data Science

DECEMBER 10, 2024

TL;DR Take advantage of old school database tricks, like ENUM data types, and column constraints. Some positives (Microsoft Access comes to mind), but some are questionable at best, such as traditional data design principles and data quality and validation at ingestion. Lets get toit! Generate data lineage with one small Pythonscript.

PostgreSQL

PostgreSQL Data Python SQL

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. To address these challenges, AI Data Engineers have emerged as key players, designing scalable data workflows that fuel the next generation of AI systems. Experience with vector databases (e.g.,

Data Engineer

Data Engineer Data Engineering Unstructured Data Engineering

Enabling Seamless Cloud Migration and Real-Time Data Integration for a Nonprofit Educational Healthcare Organization with Striim

Striim

OCTOBER 31, 2024

A nonprofit educational healthcare organization is faced with the challenge of modernizing its critical systems while ensuring uninterrupted access to essential services. However, while the SIS migration was a significant step forward, the institution’s on-premise SQL Server systems remained vital.

Education

Education Healthcare Data Integration Cloud

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Ensure the provider supports the infrastructure necessary for your data needs, such as managed databases, storage, and data pipeline services. Leverage Built-In Partitioning Features: Use built-in features provided by databases like Snowflake or Databricks to automatically partition large datasets.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. How have the requirements of generative AI shifted the demand for streaming data systems? What role does Flink play in the architecture of generative AI systems?

Process

Process Data Lake High Quality Data Machine Learning

Designing A Non-Relational Database Engine

Redefining AIOps IT Workflows with Legacy System Visibility

Webinars

Trending Sources

The Roots of Today's Modern Backend Engineering Practices

Webinars

Data Migration Strategies For Large Scale Systems

Monetizing Analytics Features: Why Data Visualizations Will Never Be Enough

A Beginner’s Guide to Geospatial with DuckDB

Why Open Table Format Architecture is Essential for Modern Data Systems

Reconciling The Data In Your Databases With Datafold

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Designing Data Transfer Systems That Scale

Data Integrity for AI: What’s Old is New Again

How Apache Iceberg Is Changing the Face of Data Lakes

How Meta discovers data flows via lineage at scale

Zenlytic Is Building You A Better Coworker With AI Agents

Interesting startup idea: benchmarking cloud platform pricing

Weekend maintenance kicks an Italian bank offline for days

A Look At The Data Systems Behind The Gameplay For League Of Legends

Building Pinterest’s new wide column database using RocksDB

A Dive into the Basics of Big Data Storage with HDFS

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Paying down tech debt: further learnings

Simplifying Data Architecture and Security to Accelerate Value

Change Data Capture at Pinterest

Troubleshooting Kafka In Production

Indexing code at scale with Glean

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Datadog’s $65M/year customer mystery solved

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

Accelerate AI Development with Snowflake

Behind the Scenes with Two New Salary Transparency Websites

Data Engineering Weekly #195

Snowflake’s Fully Managed Service: Beyond Serverless

Inside Agoda’s Private Cloud - Exclusive

Tackling Real Time Streaming Data With SQL Using RisingWave

The Race For Data Quality in a Medallion Architecture

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Evaluating Change Data Capture Tools: A Comprehensive Guide

Stop Overcomplicating Data Quality

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Enabling Seamless Cloud Migration and Real-Time Data Integration for a Nonprofit Educational Healthcare Organization with Striim

How To Future-Proof Your Data Pipelines

X-Ray Vision For Your Flink Stream Processing With Datorios

Stay Connected