Data Lake and Data Process - Data Engineering Digest

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up.

Data Process

Data Process Process Data Lake High Quality Data

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Data Engineering Podcast

NOVEMBER 20, 2021

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.

Data Lake

Data Lake Data Integration Lambda Architecture Process

Centralize Your Data Processes With a DataOps Process Hub

DataKitchen

NOVEMBER 4, 2021

The typical pharmaceutical organization faces many challenges which slow down the data team: Raw, barely integrated data sets require engineers to perform manual , repetitive, error-prone work to create analyst-ready data sets. Cloud computing has made it much easier to integrate data sets, but that’s only the beginning.

Process

Process Data Process Pharmaceutical Data Lake

Webinars

Going Beyond Chatbots: Connecting AI to Your Tools, Systems, & Data

Smart Tech + Human Expertise = How to Modernize Manufacturing Without Losing Control

MORE WEBINARS

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Towards Data Science

NOVEMBER 16, 2023

Data Management A tutorial on how to use VDK to perform batch data processing Photo by Mika Baumeister on Unsplash Versatile Data Ki t (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities.

Data Process

Data Process Process Raw Data Data

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Data Engineering Podcast

SEPTEMBER 24, 2021

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Start trusting your data with Monte Carlo today! Start trusting your data with Monte Carlo today!

Data Process

Data Process Python Process Data Lake

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Edureka

APRIL 22, 2025

It incorporates elements from several Microsoft products working together, like Power BI, Azure Synapse Analytics, Data Factory, and OneLake, into a single SaaS experience. No matter the workload, Fabric stores all data on OneLake, a single, unified data lake built on the Delta Lake model.

BI

BI Pipeline-centric Data Lake Google Cloud

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

JUNE 21, 2024

Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking full storage interoperability. Iceberg tables provide compute engine interoperability over a single copy of data.

Data Lake

Data Lake BI Business Intelligence Metadata

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. Data lakes are notoriously complex. Data lakes are notoriously complex.

Process

Process Data Lake High Quality Data Machine Learning

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Note : Cloud Data warehouses like Snowflake and Big Query already have a default time travel feature. However, this feature becomes an absolute must-have if you are operating your analytics on top of your data lake or lakehouse. It can also be integrated into major data platforms like Snowflake.

Architecture

Architecture Systems Data Lake Google Cloud

Data Lakes vs. Data Warehouses

Grouparoo

JANUARY 11, 2022

This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle. There are two main options available, a data lake and a data warehouse. What is a Data Warehouse? What is a Data Lake?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Data Engineering Podcast

NOVEMBER 27, 2022

Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box.

Data Process

Data Process Process Metadata Business Intelligence

5 Data Lake Examples That Prove They’re Not Just a Buzzword

Monte Carlo

SEPTEMBER 25, 2024

A data lake is essentially a vast digital dumping ground where companies toss all their raw data, structured or not. A modern data stack can be built on top of this data storage and processing layer, or a data lakehouse or data warehouse, to store data and process it before it is later transformed and sent off for analysis.

Data Lake

Data Lake Food Google Cloud AWS

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Data lakes are useful, flexible data storage repositories that enable many types of data to be stored in its rawest state. Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

An Exploration Of The Composable Customer Data Platform

Data Engineering Podcast

APRIL 9, 2023

Summary The customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

Data Lake

Data Lake Data Warehouse Machine Learning Data

Tips to Build a Robust Data Lake Infrastructure

DareData

JULY 5, 2023

Learn how we build data lake infrastructures and help organizations all around the world achieving their data goals. In today's data-driven world, organizations are faced with the challenge of managing and processing large volumes of data efficiently.

Data Lake

Data Lake Building Raw Data ETL Tools

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake? What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Cloudera announces support for Azure’s next-generation Data Lake Store

Cloudera

FEBRUARY 14, 2019

The Cloudera platform delivers a one-stop shop that allows you to store any kind of data, process and analyze it in many different ways in a single environment, and integrate with the rest of your data infrastructure. But working with cloud storage has often been a compromise. Read about the ADLS Gen2 announcement on Azure.com.

Data Lake

Data Lake Hadoop Cloud Storage Cloud

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

I finally found a good critique that discusses its flaws, such as multi-hop architecture, inefficiencies, high costs, and difficulties maintaining data quality and reusability. The article advocates for a "shift left" approach to data processing, improving data accessibility, quality, and efficiency for operational and analytical use cases.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. What is a data lake?

Data Lake

Data Lake Architecture IT Amazon Web Services

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

That’s why it’s essential for teams to choose the right architecture for the storage layer of their data stack. But, the options for data storage are evolving quickly. Different vendors offering data warehouses, data lakes, and now data lakehouses all offer their own distinct advantages and disadvantages for data teams to consider.

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

How to Keep Track of Data Versions Using Versatile Data Kit

Towards Data Science

MAY 3, 2023

One such tool is the Versatile Data Kit (VDK), which offers a comprehensive solution for controlling your data versioning needs. VDK helps you easily perform complex operations, such as data ingestion and processing from different sources, using SQL or Python. Use VDK to build a data lake and merge multiple sources.

Data Lake

Data Lake SQL Data Data Warehouse

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse BI SQL

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Fluss is a compelling new project in the realm of real-time data processing. Fluss uses Lakehouse as a tiered storage, and data will be converted and tiered into data lakes periodically; Fluss only retains a small portion of recent data. The fourth difference is the Lakehouse Architecture.

Kafka

Kafka Lambda Architecture SQL Architecture

Evaluating Data Observability Tools: A Comprehensive Guide

Data Engineering Weekly

SEPTEMBER 18, 2024

The Rise of Data Observability Data observability has become increasingly critical as companies seek greater visibility into their data processes. This growing demand has found a natural synergy with the rise of the data lake. As a result, monitoring data in real time was often an afterthought.

Data Lake

Data Lake Data Pipeline Unstructured Data Data

Data Engineering: A Formula 1-inspired Guide for Beginners

Towards Data Science

DECEMBER 4, 2023

A robust data infrastructure is a must-have to compete in the F1 business. We’ll build a data architecture to support our racing team starting from the three canonical layers : Data Lake, Data Warehouse, and Data Mart. in alphabetical order: Apache Airflow, Azure Data Factory, DBT, Google DataForm, …).

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Unify your data: AI and Analytics in an Open Lakehouse

Cloudera

MAY 30, 2024

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission-critical, large-scale data analytics and AI use cases—including enterprise data warehouses. Support for Modern Analytics Workloads : With support for both SQL-based querying and advanced analytics frameworks (e.g.,

Data Lake

Data Lake Data Warehouse Programming Language Data Ingestion

Snowflake’s Fully Managed Service: Beyond Serverless

Snowflake

FEBRUARY 13, 2025

Unique automations and optimizations include encryption by default, built-in storage compression and fast access to data even at petabyte scale. Snowflake's flexibility enables businesses to deploy a wide range of architectural patterns including a data lake, data warehouse, lakehouse or data mesh.

Management

Management Government Cloud Unstructured Data

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

Furthermore, Striim also supports real-time data replication and real-time analytics, which are both crucial for your organization to maintain up-to-date insights. By efficiently handling data ingestion, this component sets the stage for effective data processing and analysis.

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Snowflake

OCTOBER 16, 2024

The company quickly realized maintaining 10 years’ worth of production data while enabling real-time data ingestion led to an unscalable situation that would have necessitated a data lake. Data scientists also benefited from a scalable environment to build machine learning models without fear of system crashes.

Digital Media

Digital Media Media Data Lake Data Warehouse

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

link] QuantumBlack: Solving data quality for gen AI applications Unstructured data processing is a top priority for enterprises that want to harness the power of GenAI. It brings challenges in data processing and quality, but what data quality means in unstructured data is a top question for every organization.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Laying the Foundation for Modern Data Architecture

Cloudera

MAY 28, 2024

This form of architecture can handle data in all forms—structured, semi-structured, unstructured—blending capabilities from data warehouses and data lakes into data lakehouses.

Data Architecture

Data Architecture Architecture Data Lake Data Warehouse

Hands-On Introduction to Delta Lake with (py)Spark

Towards Data Science

FEBRUARY 15, 2023

With that in mind (and a bunch of other things), Delta Lake was developed, an open-source data storage framework that implements/materializes the Lakehouse architecture and the topic of today’s post. What is Delta Lake? The data became useless. The Delta Lake is a framework for storage based on the Lakehouse paradigm.

Data Lake

Data Lake Data Warehouse Hadoop Architecture

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple Engineering

JULY 9, 2024

Ripple's Journey and Challenges with the Legacy System Our legacy system was once at the forefront of big data processing, but as our operations grew, we faced a tangle of complexities. High maintenance costs and a system that struggled to meet the real-time demands of our data-driven initiatives.

Hadoop

Hadoop Data Lake Machine Learning Raw Data

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

Think of it as the “slow and steady wins the race” approach to data processing. Stream Processing Pattern Now, imagine if instead of waiting to do laundry once a week, you had a magical washing machine that could clean each piece of clothing the moment it got dirty.

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

He wrote some years ago 3 articles defining data engineering field. Some concepts When doing data engineering you can touch a lot of different concepts. You'll be seen as the most technical person of a data team and you'll need to help regarding "low-level" stuff you team.

Data Engineer

Data Engineer Data Engineering Engineering Hadoop

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Change Data Capture (CDC) has emerged as an ideal solution for near real-time movement of data from relational databases (like SQL Server or Oracle) to data warehouses, data lakes or other databases. Data can be extracted using database queries (batch-based) or Change Data Capture (near-real-time).

IT

IT Data Lake Data Warehouse Relational Database

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

In addition, AI data engineers should be familiar with programming languages such as Python , Java, Scala, and more for data pipeline, data lineage, and AI model development. Data Storage Solutions As we all know, data can be stored in a variety of ways.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Anecdotes AI Accelerates Time to Market with Efficient Large-Scale Compliance Data Processing in Snowflake

Snowflake

JULY 18, 2023

Data infrastructure that makes light work of complex tasks Built as a connected application from day one, the anecdotes Compliance OS uses the Snowflake Data Cloud for data ingestion and modeling, including a single cybersecurity data lake where all data can be analyzed within Snowflake.

Data Process

Data Process Process Data Lake Data Ingestion

Connecting the Data Lifecycle

Cloudera

NOVEMBER 29, 2021

Carrefour Spain , a branch of the larger company (with 1,250 stores), processes over 3 million transactions every day, giving rise to challenges like creating and managing a data lake and honing down key demographic information. . The firm also worked on creating a solid pipeline from the data warehouse to the data lake.

Data Lake

Data Lake Telecommunication Retail Data

Keys to Ensure that Data isn’t Slowing Down your Innovation Efforts

Cloudera

AUGUST 18, 2021

In modern hybrid environments, data traverses clouds, on-premise infrastructure and IoT networks, so the process can get very complex. It requires rethinking the data lifecycle itself. . If the data goes into a data lake before analysis, extracting it can get pretty complex and time-consuming.

Medical

Medical Hospitality Data Lake Healthcare

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

AWS Glue is a widely-used serverless data integration service that uses automated extract, transform, and load ( ETL ) methods to prepare data for analysis. It offers a simple and efficient solution for data processing in organizations. Then, Glue writes the job's metadata into the embedded AWS Glue Data Catalog.

AWS

AWS Scala Metadata Data Lake

Announcing the 2020 Data Impact Award Winners

Cloudera

NOVEMBER 18, 2020

The Advanced Analytics team supporting the businesses of Merck KGaA, Darmstadt, Germany was able to establish a data governance framework within its enterprise data lake. This enabled Merck KGaA to control and maintain secure data access, and greatly increase business agility for multiple users.

Medical

Medical Banking Telecommunication Government

Gearing Up for Gartner Data & Analytics Summit 2025

Monte Carlo

JANUARY 21, 2025

Data Governance & Ethics : Understand emerging data regulations and ethical frameworks that shape how organizations collect, store, and use data. Why Gartners Data & Analytics Summit Matters In a world where real-time insights and advanced analytics can make or break an enterprise, staying ahead of the curve is crucial.

Data Analytics

Data Analytics Pipeline-centric Food Data Lake

Best Morgan Stanley Data Engineer Interview Questions

U-Next

MARCH 1, 2023

Being a hybrid role, Data Engineer requires technical as well as business skills. They build scalable data processing pipelines and provide analytical insights to business users. A Data Engineer also designs, builds, integrates, and manages large-scale data processing systems.

Data Engineer

Data Engineer Data Engineering Non-relational Database Engineering

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Webinars

Trending Sources

Centralize Your Data Processes With a DataOps Process Hub

Webinars

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

X-Ray Vision For Your Flink Stream Processing With Datorios

Why Open Table Format Architecture is Essential for Modern Data Systems

Data Lakes vs. Data Warehouses

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

5 Data Lake Examples That Prove They’re Not Just a Buzzword

Top Data Lake Vendors (Quick Reference Guide)

An Exploration Of The Composable Customer Data Platform

Tips to Build a Robust Data Lake Infrastructure

Data Lake vs Data Warehouse - Working Together in the Cloud

Cloudera announces support for Azure’s next-generation Data Lake Store

Data Engineering Weekly #206

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

How to Keep Track of Data Versions Using Versatile Data Kit

The Future of the Data Lakehouse – Open

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Evaluating Data Observability Tools: A Comprehensive Guide

Data Engineering: A Formula 1-inspired Guide for Beginners

Unify your data: AI and Analytics in an Open Lakehouse

Snowflake’s Fully Managed Service: Beyond Serverless

A Guide to Data Pipelines (And How to Design One From Scratch)

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Data Engineering Weekly #207

Laying the Foundation for Modern Data Architecture

Hands-On Introduction to Delta Lake with (py)Spark

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

8 Essential Data Pipeline Design Patterns You Should Know

How to learn data engineering

Change Data Capture (CDC): What it is and How it Works

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Anecdotes AI Accelerates Time to Market with Efficient Large-Scale Compliance Data Processing in Snowflake

Connecting the Data Lifecycle

Keys to Ensure that Data isn’t Slowing Down your Innovation Efforts

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Announcing the 2020 Data Impact Award Winners

Gearing Up for Gartner Data & Analytics Summit 2025

Best Morgan Stanley Data Engineer Interview Questions

Stay Connected