Data, Hadoop and SQL - Data Engineering Digest

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

Many of our customers — from Marriott to AT&T — start their journey with the Snowflake AI Data Cloud by migrating their data warehousing workloads to the platform. Today we’re focusing on customers who migrated from a cloud data warehouse to Snowflake and some of the benefits they saw. million in cost savings annually.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Hadoop and Spark are the two most popular platforms for Big Data processing. They both enable you to deal with huge collections of data no matter its format — from Excel tables to user feedback on websites to images and video files. What are its limitations and how do the Hadoop ecosystem address them? What is Hadoop.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. This switch has been lead by modern data stack vision.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Engineering SQL Support on Apache Pinot at Uber

Uber Engineering

JANUARY 15, 2020

Uber leverages real-time analytics on aggregate data to improve the user experience across our products, from fighting fraudulent behavior on Uber Eats to forecasting demand on our platform. .

SQL

SQL Engineering Aggregated Data Hadoop

Expediting SQL Workers means Expediting your Business

Cloudera

NOVEMBER 10, 2020

Two of the more painful things in your everyday life as an analyst or SQL worker are not getting easy access to data when you need it, or not having easy to use, useful tools available to you that don’t get in your way! This simple statement captures the essence of almost 10 years of SQL development with modern data warehousing.

SQL

SQL Unstructured Data Hadoop Data Lake

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

In that time there have been a number of generational shifts in how data engineering is done. Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Materialize]([link] Looking for the simplest way to get the freshest data possible to your teams?

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

The Best Data Dictionary Tools in 2025

Monte Carlo

APRIL 28, 2025

Different teams love using the same data in totally different ways. Thats where data dictionary tools come in. A data dictionary tool helps define and organize your data so everyones speaking the same language. A data dictionary tool helps define and organize your data so everyones speaking the same language.

Metadata

Metadata Hadoop Data SQL

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

Summary A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. Data lakes are notoriously complex. Your first 30 days are free! Want to see Starburst in action? Can you describe what the focus of Dagster+ is and the story behind it?

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Data Engineering Podcast

SEPTEMBER 7, 2020

For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. If you hand a book to a new data engineer, what wisdom would you add to it? And don’t forget to thank them for their continued support of this show!

Architecture

Architecture Data Architecture SQL Engineering

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. Data projects are notoriously complex.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Learn data engineering, all the references ( credits ) This is a special edition of the Data News. But right now I'm in holidays finishing a hiking week in Corsica 🥾 So I wrote this special edition about: how to learn data engineering in 2024. The idea is to create a living reference about Data Engineering.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Cloudera

MAY 18, 2021

Prior the introduction of CDP Public Cloud, many organizations that wanted to leverage CDH, HDP or any other on-prem Hadoop runtime in the public cloud had to deploy the platform in a lift-and-shift fashion, commonly known as “Hadoop-on-IaaS” or simply the IaaS model. Introduction. 7,500-11,500. 8,500-14,500. 5,500-9,000.

Hadoop

Hadoop Cloud AWS Utilities

Data Science Blogathon 30th Edition- Women in Data Science

Analytics Vidhya

MARCH 8, 2023

The Biggest Data Science Blogathon is now live! Martin Uzochukwu Ugwu Analytics Vidhya is back with the largest data-sharing knowledge competition- The Data Science Blogathon. Knowledge is power. Sharing knowledge is the key to unlocking that power.”―

Data Science

Data Science Data Cloud Computing Deep Learning

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Ozone natively provides Amazon S3 and Hadoop Filesystem compatible endpoints in addition to its own native object store API endpoint and is designed to work seamlessly with enterprise scale data warehousing, machine learning and streaming workloads. Data ingestion through ‘s3’. Ozone Namespace Overview. import boto3.

Data Science

Data Science Cloud Hadoop Metadata

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Data Engineering Podcast

JANUARY 6, 2019

Summary The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. For a perfect pairing, they made it easy to connect to the Impala SQL engine.

Data Analytics

Data Analytics Hadoop Kafka Media

Data News — Week 23.14

Christophe Blefari

APRIL 8, 2023

Data News entering in town ( credits ) Hey you, if I wasn't late in my newsletter writing it wouldn't be me. But here is your usual Data News. Data modeling Dear readers, I have to confess something. I did not care about data modeling for years. I was in the Hadoop world and all I was doing was denormalisation.

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

Data News — Week 13.14

Christophe Blefari

APRIL 8, 2023

Data News entering in town ( credits ) Hey you, if I wasn't late in my newsletter writing it wouldn't be me. But here is your usual Data News. Data modeling Dear readers, I have to confess something. I did not care about data modeling for years. I was in the Hadoop world and all I was doing was denormalisation.

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

Architecture

Architecture Systems Data Lake Google Cloud

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

By the time I left in 2013, I was a data engineer. We were data engineers! Data Engineering? Data science as a discipline was going through its adolescence of self-affirming and defining itself. At the same time, data engineering was the slightly younger sibling, but it was going through something similar.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

The rise of AI and GenAI has brought about the rise of new questions in the data ecosystem – and new roles. One job that has become increasingly popular across enterprise data teams is the role of the AI data engineer. Demand for AI data engineers has grown rapidly in data-driven organizations.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis.

Data Lake

Data Lake Data Warehouse Hadoop Architecture

Will Hadoop and Big Data replace traditional Data warehousing?

Knowledge Hut

MAY 20, 2024

The enterprise data warehouse (EDW) is the backbone of analytics and business intelligence for most large organizations and many midsize firms. The tools and techniques are proven, the SQL query language is well known, and there’s plenty of expertise available to keep EDWs humming.

Hadoop

Hadoop Big Data BI Business Intelligence

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Data Engineering Podcast

MAY 20, 2018

Summary Most businesses end up with data in a myriad of places with varying levels of structure. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Can you start by explaining what Presto is?

PostgreSQL

PostgreSQL Hadoop SQL Kafka

Most Essential 2023 Interview Questions on Data Engineering

Analytics Vidhya

FEBRUARY 7, 2023

Introduction Data engineering is the field of study that deals with the design, construction, deployment, and maintenance of data processing systems. The goal of this domain is to collect, store, and process data efficiently and efficiently so that it can be used to support business decisions and power data-driven applications.

Data Engineering

Data Engineering Data Engineer Engineering Data

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple Engineering

JULY 9, 2024

Introduction: Embracing the Future with Ripple's Data Platform Migration Welcome to a pivotal moment in Ripple's data journey. As leaders at the intersection of blockchain technology and financial services, we're excited to share a transformative step in our data management evolution.

Hadoop

Hadoop Data Lake Machine Learning Raw Data

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Data Engineering Podcast

JANUARY 13, 2019

release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform.

Database

Database PostgreSQL SQL MongoDB

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

Big data in information technology is used to improve operations, provide better customer service, develop customized marketing campaigns, and take other actions to increase revenue and profits. It is especially true in the world of big data. It is especially true in the world of big data. What Are Big Data T echnologies?

Big Data

Big Data Technology Hadoop NoSQL

Adopting Spark Connect

Towards Data Science

NOVEMBER 6, 2024

Spark has long allowed to run SQL queries on a remote Thrift JDBC server. It can negatively affect data readiness time and user experience. Based on this data, the service automatically determines for each application whether it should be run this time on the Spark Connect server or as a separate Spark application.

Scala

Scala Java AWS Coding

Apache Spark vs MapReduce: A Detailed Comparison

Knowledge Hut

MAY 2, 2024

Why We Need Big Data Frameworks Big data is primarily defined by the volume of a data set. Big data sets are generally huge – measuring tens of terabytes – and sometimes crossing the threshold of petabytes. It is surprising to know how much data is generated every minute. As estimated by DOMO : Over 2.5

Hadoop

Hadoop Scala Datasets Java

Inside Agoda’s Private Cloud - Exclusive

The Pragmatic Engineer

JUNE 13, 2023

In a previous two-part series , we dived into Uber’s multi-year project to move onto the cloud , away from operating its own data centers. The number of developers, physical cores, data centers, and more. The cloud or your own data centers? To get articles like this every week, subscribe here.

Cloud

Cloud Database Utilities BI

A High Performance Platform For The Full Big Data Lifecycle

Data Engineering Podcast

AUGUST 19, 2019

Summary Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system.

Big Data

Big Data Hadoop Data Lake Media

How to Become a Data Engineer in 2024?

Knowledge Hut

DECEMBER 26, 2023

Data Engineering is typically a software engineering role that focuses deeply on data – namely, data workflows, data pipelines, and the ETL (Extract, Transform, Load) process. What is Data Science? What are the roles and responsibilities of a Data Engineer? What is the need for Data Science?

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Data News — Week 22.45

Christophe Blefari

NOVEMBER 11, 2022

Mastodon and Hadoop are on a boat. On a social note, today I've joined data-folks Mastodon server, you can follow me there. I'll speak about "How to build the data dream team" Let's jump onto the news. Ingredients of a Data Warehouse Going back to basics. I mainly work 3 to 4 days a week.

BI

BI Data Warehouse Data Database

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Data Engineering Podcast

NOVEMBER 20, 2021

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant.

Data Lake

Data Lake Data Integration Lambda Architecture Process

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

This year, the Snowflake Summit was held in San Francisco from June 2 to 5, while the Databricks Data+AI Summit took place 5 days later, from June 10 to 13, also in San Francisco. Using a quick semantic analysis, "The" means both want to be THE platform you need when you're doing data.

Metadata

Metadata Data Warehouse BI MySQL

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

With instant elasticity, high-performance, and secure data sharing across multiple clouds , Snowflake has become highly in-demand for its cloud-based data warehouse offering. As organizations adopt Snowflake for business-critical workloads, they also need to look for a modern data integration approach.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Knowledge Hut

DECEMBER 21, 2023

In the present-day world, almost all industries are generating humongous amounts of data, which are highly crucial for the future decisions that an organization has to make. This massive amount of data is referred to as “big data,” which comprises large amounts of data, including structured and unstructured data that has to be processed.

Hadoop

Hadoop Big Data NoSQL Unstructured Data

Taking A Tour Of The Google Cloud Platform For Data And Analytics

Data Engineering Podcast

JUNE 11, 2021

Summary Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. No more scripts, just SQL.

Google Cloud

Google Cloud Cloud Big Data Ecosystem Data Warehouse

Top 8 Hadoop Projects to Work in 2024

Knowledge Hut

DECEMBER 28, 2023

Imagine having a framework capable of handling large amounts of data with reliability, scalability, and cost-effectiveness. That's where Hadoop comes into the picture. Hadoop is a popular open-source framework that stores and processes large datasets in a distributed manner. Why Are Hadoop Projects So Important?

Hadoop

Hadoop Project Big Data Datasets

A Reflection On The Data Ecosystem For The Year 2021

Data Engineering Podcast

JANUARY 1, 2022

Summary This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Missing data?

Data Warehouse

Data Warehouse Hadoop SQL Data Lake

SQL for Data Engineering: Success Blueprint for Data Engineers

ProjectPro

FEBRUARY 16, 2023

The demand for skilled data engineers who can build, maintain, and optimize large data infrastructures does not seem to slow down any sooner. At the heart of these data engineering skills lies SQL that helps data engineers manage and manipulate large amounts of data. of data engineer job postings on Indeed?

Data Engineering

Data Engineering Data Engineer SQL Engineering

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Data Engineering Podcast

NOVEMBER 22, 2017

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Hadoop

Hadoop Data Storage Data Pipeline Data Engineering

Top 30 Data Scientist Skills to Master in 2024

Knowledge Hut

DECEMBER 22, 2023

Data analytics, data mining, artificial intelligence, machine learning, deep learning, and other related matters are all included under the collective term "data science" When it comes to data science, it is one of the industries with the fastest growth in terms of income potential and career opportunities.

Hadoop

Hadoop Deep Learning Data Science Machine Learning

Top 8 Interview Questions on Apache Sqoop

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Webinars

Trending Sources

Hadoop vs Spark: Main Big Data Tools Explained

Webinars

How to get started with dbt

Engineering SQL Support on Apache Pinot at Uber

Expediting SQL Workers means Expediting your Business

Reflecting On The Past 6 Years Of Data Engineering

The Best Data Dictionary Tools in 2025

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Modern Customer Data Platform Principles

How to learn data engineering

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Data Science Blogathon 30th Edition- Women in Data Science

Apache Ozone Powers Data Science in CDP Private Cloud

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Data News — Week 23.14

Data News — Week 13.14

Why Open Table Format Architecture is Essential for Modern Data Systems

The Rise of the Data Engineer

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Will Hadoop and Big Data replace traditional Data warehousing?

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Most Essential 2023 Interview Questions on Data Engineering

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Big Data Technologies that Everyone Should Know in 2024

Adopting Spark Connect

Apache Spark vs MapReduce: A Detailed Comparison

Inside Agoda’s Private Cloud - Exclusive

A High Performance Platform For The Full Big Data Lifecycle

How to Become a Data Engineer in 2024?

Data News — Week 22.45

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Databricks, Snowflake and the future

5 Advantages of Real-Time ETL for Snowflake

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Taking A Tour Of The Google Cloud Platform For Data And Analytics

Top 8 Hadoop Projects to Work in 2024

A Reflection On The Data Ecosystem For The Year 2021

SQL for Data Engineering: Success Blueprint for Data Engineers

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Top 30 Data Scientist Skills to Master in 2024

Stay Connected