Data, Data Warehouse and Raw Data - Data Engineering Digest

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

Does the LLM capture all the relevant data and context required for it to deliver useful insights? Not to mention the crazy stories about Gen AI making up answers without the data to back it up!) Are we allowed to use all the data, or are there copyright or privacy concerns? But simply moving the data wasnt enough.

Data Integration

Data Integration Hadoop Data Warehouse Data Lake

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. This switch has been lead by modern data stack vision.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Were sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand.

Accessible

Accessible Accessibility Raw Data Data Warehouse

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot.

Data Engineering

Data Engineering Data Engineer Data Process Process

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

What is Data Transformation? Data transformation is the process of converting raw data into a usable format to generate insights. It involves cleaning, normalizing, validating, and enriching data, ensuring that it is consistent and ready for analysis.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

The Downfall of the Data Engineer

Maxime Beauchemin

AUGUST 28, 2017

This post follows up on The Rise of the Data Engineer , a recent post that was an attempt at defining data engineering and described how this new role relates to historical and modern roles in the data space. The data warehouse needs to reflect the business, and the business should have clarity on how it thinks about analytics.

Data Engineering

Data Engineering Data Engineer Engineering Software Engineer

Data Warehouse vs. Data Lake

Precisely

MARCH 9, 2023

As cloud computing platforms make it possible to perform advanced analytics on ever larger and more diverse data sets, new and innovative approaches have emerged for storing, preprocessing, and analyzing information. In this article, we’ll focus on a data lake vs. data warehouse.

Data Lake

Data Lake Data Warehouse Hadoop Raw Data

The No-Panic Guide to Building a Data Engineering Pipeline That Actually Scales

Monte Carlo

NOVEMBER 22, 2024

Your data engineering pipeline started simple: a few CSV exports, some Python scripts, and manual updates every week. You’re left wondering if there’s a breaking point where your DIY data solution won’t cut it anymore—and honestly, you might be there already. Once you’ve got the data flowing in, you need somewhere to put it.

Data Engineering

Data Engineering Data Engineer Building Engineering

Setting up Data Lake on GCP using Cloud Storage and BigQuery

Analytics Vidhya

FEBRUARY 25, 2023

Introduction A data lake is a centralized and scalable repository storing structured and unstructured data. The need for a data lake arises from the growing volume, variety, and velocity of data companies need to manage and analyze.

Cloud Storage

Cloud Storage Data Lake Cloud Unstructured Data

Data Lakes vs. Data Warehouses

Grouparoo

JANUARY 11, 2022

When it comes to storing large volumes of data, a simple database will be impractical due to the processing and throughput inefficiencies that emerge when managing and accessing big data. This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle.

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Digital Transformation is a Data Journey From Edge to Insight

Cloudera

JANUARY 20, 2021

Most of what is written though has to do with the enabling technology platforms (cloud or edge or point solutions like data warehouses) or use cases that are driving these benefits (predictive analytics applied to preventive maintenance, financial institution’s fraud detection, or predictive health monitoring as examples) not the underlying data.

Manufacturing

Manufacturing Data Warehouse Kafka Retail

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

Let’s set the scene: your company collects data, and you need to do something useful with it. Whether it’s customer transactions, IoT sensor readings, or just an endless stream of social media hot takes, you need a reliable way to get that data from point A to point B while doing something clever with it along the way.

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

5 Helpful Extract & Load Practices for High-Quality Raw Data

Meltano

DECEMBER 7, 2022

ELT is becoming the default choice for data architectures and yet, many best practices focus primarily on “T”: the transformations. But the extract and load phase is where data quality is determined for transformation and beyond. “Raw data” sounds clear. But wait, why aren’t these “best practices”?

Raw Data

Raw Data Metadata Data Database

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

This year, the Snowflake Summit was held in San Francisco from June 2 to 5, while the Databricks Data+AI Summit took place 5 days later, from June 10 to 13, also in San Francisco. Using a quick semantic analysis, "The" means both want to be THE platform you need when you're doing data.

Metadata

Metadata Data Warehouse BI MySQL

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

Data is central to modern business and society. Depending on what sort of leaky analogy you prefer, data can be the new oil , gold , or even electricity. Of course, even the biggest data sets are worthless, and might even be a liability, if they arent organized properly.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Best Practices for Migrating Historical Data to Snowflake

Snowflake

NOVEMBER 30, 2023

At TCS , we help companies shift their enterprise data warehouse (EDW) platforms to the cloud as well as offering IT services. We’re extremely familiar with just how tricky a cloud migration can be, especially when it involves moving historical business data. How many tables and views will be migrated, and how much raw data?

Data Warehouse

Data Warehouse Banking Data Cloud

The Hidden Threats in Your Data Warehouse Layers (And How to Fix Them)

Monte Carlo

AUGUST 6, 2024

Data warehouses are the centralized repositories that store and manage data from various sources. They are integral to an organization’s data strategy, ensuring data accessibility, accuracy, and utility. However, beneath their surface lies a host of invisible risks embedded within the data warehouse layers.

Data Warehouse

Data Warehouse Raw Data Machine Learning BI

Data Lake vs. Data Warehouse: Differences and Similarities

U-Next

SEPTEMBER 7, 2022

The terms “ Data Warehouse ” and “ Data Lake ” may have confused you, and you have some questions. Structuring data refers to converting unstructured data into tables and defining data types and relationships based on a schema. What is Data Warehouse? .

Data Lake

Data Lake Data Warehouse Unstructured Data Amazon Web Services

Data News — Week 23.16

Christophe Blefari

APRIL 21, 2023

A lot of data teams embraced dbt, or at least the SQL with engineering practices to transform data in cloud data warehouses. It is interesting to read this post jointly with the future of data engineer at Meta. Data Economy 💰 Betterdata raises $1.65m seed round. Synthetic data are AI generated data.

Raw Data

Raw Data Data SQL Datasets

Implementing a Pharma Data Mesh using DataOps

DataKitchen

AUGUST 19, 2021

Below is our fourth post (4 of 5) on combining data mesh with DataOps to foster innovation while addressing the challenges of a decentralized architecture. We’ve covered the basic ideas behind data mesh and some of the difficulties that must be managed. Below is a discussion of a data mesh implementation in the pharmaceutical space.

Pharmaceutical

Pharmaceutical Data Lake Data Warehouse Raw Data

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

Modern companies are ingesting, storing, transforming, and leveraging more data to drive more decision-making than ever before. Data teams need to balance the need for robust, powerful data platforms with increasing scrutiny on costs. But, the options for data storage are evolving quickly. Let’s dive in.

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Towards Data Science

NOVEMBER 16, 2023

Data Management A tutorial on how to use VDK to perform batch data processing Photo by Mika Baumeister on Unsplash Versatile Data Ki t (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities.

Data Process

Data Process Process Raw Data Data

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

This generalisation makes their data models complex and cryptic and require domain expertise. Even harder to manage, a common setup within large organisations is to have several instances of these systems with some underlaying processes in charge of transmitting data among them, which could lead to duplications, inconsistencies, and opacity.

Systems

Systems Raw Data Metadata Data Cleanse

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Podcast

DECEMBER 16, 2019

The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools.

Metadata

Metadata PostgreSQL Datasets Data Warehouse

Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform

Data Engineering Podcast

DECEMBER 11, 2021

In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. Missing data? Start trusting your data with Monte Carlo today! Struggling with broken pipelines? Stale dashboards?

Data Warehouse

Data Warehouse Raw Data Data Lake BI

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

Data pipelines are the backbone of your business’s data architecture. Implementing a robust and scalable pipeline ensures you can effectively manage, analyze, and organize your growing data. Most importantly, these pipelines enable your team to transform data into actionable insights, demonstrating tangible business value.

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

A 2016 data science report from data enrichment platform CrowdFlower found that data scientists spend around 80% of their time in data preparation (collecting, cleaning, and organizing of data) before they can even begin to build machine learning (ML) models to deliver business value.

Engineering

Engineering Raw Data Data Science Machine Learning

An Overview of Real Time Data Warehousing on Cloudera

Cloudera

NOVEMBER 2, 2020

Users today are asking ever more from their data warehouse. As an example of this, in this post we look at Real Time Data Warehousing (RTDW), which is a category of use cases customers are building on Cloudera and which is becoming more and more common amongst our customers. Ingest 100s of TB of network event data per day .

Data Warehouse

Data Warehouse Kafka Lambda Architecture Telecommunication

How to Keep Track of Data Versions Using Versatile Data Kit

Towards Data Science

MAY 3, 2023

Data Engineering Learn about slow change dimensions (SCD) and how to implement SCD Type 2 in VDK Photo by Joshua Sortino on Unsplash Data is the backbone of any organization, and in today’s fast-paced world, it is crucial to keep track of its versions. They store and manage current and historical data in a data warehouse.

Data Lake

Data Lake SQL Data Data Warehouse

5 Big Data Challenges in 2024

Knowledge Hut

MARCH 7, 2024

The year 2024 saw some enthralling changes in volume and variety of data across businesses worldwide. The surge in data generation is only going to continue. Foresighted enterprises are the ones who will be able to leverage this data for maximum profitability through data processing and handling techniques.

Big Data

Big Data Bytes Data Governance Raw Data

Building a Data Platform in 2024

Towards Data Science

FEBRUARY 9, 2024

How to build a modern, scalable data platform to power your analytics and data science projects (updated) Table of Contents: What’s changed? The Platform Integration Data Store Transformation Orchestration Presentation Transportation Observability Closing What’s changed? Over the last three years, my life has changed as well.

Building

Building Transportation Data Lake Metadata

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple Engineering

JULY 9, 2024

Introduction: Embracing the Future with Ripple's Data Platform Migration Welcome to a pivotal moment in Ripple's data journey. As leaders at the intersection of blockchain technology and financial services, we're excited to share a transformative step in our data management evolution.

Hadoop

Hadoop Data Lake Machine Learning Raw Data

Radical Simplicity in Data Engineering

Towards Data Science

JULY 26, 2024

Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinking source: unsplash.com Recently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The data industry should not be afraid to to think the same way.

Data Engineering

Data Engineering Data Engineer Engineering Data Warehouse

Startup Spotlight: APIs on Top of Snowflake with Propel

Snowflake

FEBRUARY 21, 2023

In this Q&A, we hear from Nico Acosta, CEO and Co-Founder of Propel, about how his company is building an API platform to equip developers to build with data, and why data architecture is the most important technical decision a company will make. Unlocking the creativity of developers to build with data. APIs do just that.

AWS

AWS Building Raw Data Architecture

Is the data warehouse going under the data lake?

ProjectPro

JULY 22, 2016

The desire to save every bit and byte of data for future use, to make data-driven decisions is the key to staying ahead in the competitive world of business operations. For the same cost, organizations can now store 50 times as much data as in a Hadoop data lake than in a data warehouse.

Data Lake

Data Lake Data Warehouse Hadoop Unstructured Data

Q&A with Greg Rahn – The changing Data Warehouse market

Cloudera

DECEMBER 12, 2018

After having rebuilt their data warehouse, I decided to take a little bit more of a pointed role, and I joined Oracle as a database performance engineer. I spent eight years in the real-world performance group where I specialized in high visibility and high impact data warehousing competes and benchmarks. Greg Rahn: Sure.

Data Warehouse

Data Warehouse Relational Database Hadoop Database

Best TCS Data Analyst Interview Questions and Answers for 2023

U-Next

MARCH 7, 2023

ntroduction Data Analytics is an extremely important field in today’s business world, and it will only become more so as time goes on. By 2023, Data Analytics is projected to be worth USD 240.56 The Data Analyst interview questions are very competitive and difficult. Why is MS Access important in Data Analytics?

Data Mining

Data Mining Scala Government Data Governance

Top Data Science Jobs for Freshers You Should Know

Knowledge Hut

JANUARY 18, 2024

Data Science has risen to become one of the world's topmost emerging multidisciplinary approaches in technology. Recruiters are hunting for people with data science knowledge and skills these days. Data Scientists collect, analyze, and interpret large amounts of data. Choose data sets.

Data Science

Data Science Business Analyst Data Architect ETL Method

The Guide to Common Data Engineer Design Patterns

Monte Carlo

FEBRUARY 25, 2025

Data pipelines are messy. Data engineering design patterns are repeatable solutions that help you structure, optimize, and scale data processing, storage, and movement. They make data workflows more resilient and easier to manage when things inevitably go sideways. Data lake or warehouse? Lets take a look.

Designing

Designing Data Engineering Data Engineer Engineering

Building a Kimball dimensional model with dbt

dbt Developer Hub

APRIL 19, 2023

Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.

Building

Building PostgreSQL BI Database

Mastering DBT Snowflake: A 101 Beginner’s Guide to Building Robust Data Pipelines

Hevo

FEBRUARY 15, 2023

After the hustle and bustle of extracting data from multiple sources, you have finally loaded all your data to a single source of truth like the Snowflake data warehouse. However, data modeling is still challenging and critical for transforming your raw data into any analysis-ready form to get insights.

Data Pipeline

Data Pipeline Building Raw Data Data Warehouse

How to Become a Data Engineer in 2024?

Knowledge Hut

DECEMBER 26, 2023

Data Engineering is typically a software engineering role that focuses deeply on data – namely, data workflows, data pipelines, and the ETL (Extract, Transform, Load) process. What is Data Science? What are the roles and responsibilities of a Data Engineer? What is the need for Data Science?

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Cloudera Contributor: Mark Ramsey, PhD ~ Globally Recognized Chief Data Officer. July brings summer vacations, holiday gatherings, and for the first time in two years, the return of the Massachusetts Institute of Technology (MIT) Chief Data Officer symposium as an in-person event. Luke: What is a modern data platform?

Data Lake

Data Lake Analytics Application Cloud Storage Architecture

Data Integrity for AI: What’s Old is New Again

How to get started with dbt

Webinars

Trending Sources

Data logs: The latest evolution in Meta’s access tools

Webinars

Functional Data Engineering — a modern paradigm for batch data processing

Complete Guide to Data Transformation: Basics to Advanced

The Downfall of the Data Engineer

Data Warehouse vs. Data Lake

The No-Panic Guide to Building a Data Engineering Pipeline That Actually Scales

Setting up Data Lake on GCP using Cloud Storage and BigQuery

Data Lakes vs. Data Warehouses

Digital Transformation is a Data Journey From Edge to Insight

8 Essential Data Pipeline Design Patterns You Should Know

5 Helpful Extract & Load Practices for High-Quality Raw Data

Databricks, Snowflake and the future

Data Lake vs. Data Warehouse vs. Data Lakehouse

Best Practices for Migrating Historical Data to Snowflake

The Hidden Threats in Your Data Warehouse Layers (And How to Fix Them)

Data Lake vs. Data Warehouse: Differences and Similarities

Data News — Week 23.16

Implementing a Pharma Data Mesh using DataOps

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Mastering Batch Data Processing with Versatile Data Kit (VDK)

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Solving Data Lineage Tracking And Data Discovery At WeWork

Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform

A Guide to Data Pipelines (And How to Design One From Scratch)

Data Lake vs Data Warehouse - Working Together in the Cloud

Data Vault on Snowflake: Feature Engineering and Business Vault

An Overview of Real Time Data Warehousing on Cloudera

How to Keep Track of Data Versions Using Versatile Data Kit

5 Big Data Challenges in 2024

Building a Data Platform in 2024

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Radical Simplicity in Data Engineering

Startup Spotlight: APIs on Top of Snowflake with Propel

Is the data warehouse going under the data lake?

Q&A with Greg Rahn – The changing Data Warehouse market

Best TCS Data Analyst Interview Questions and Answers for 2023

Top Data Science Jobs for Freshers You Should Know

The Guide to Common Data Engineer Design Patterns

Building a Kimball dimensional model with dbt

Mastering DBT Snowflake: A 101 Beginner’s Guide to Building Robust Data Pipelines

How to Become a Data Engineer in 2024?

Demystifying Modern Data Platforms

Stay Connected