Hadoop and SQL - Data Engineering Digest

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Hadoop and Spark are the two most popular platforms for Big Data processing. To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? scalability.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

For organizations considering moving from a legacy data warehouse to Snowflake, looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or assessing new options if your current cloud data warehouse just isn’t scaling anymore, it helps to see how others have done it.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. tests — a way to define SQL tests either at column-level, either with a query.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Engineering SQL Support on Apache Pinot at Uber

Uber Engineering

JANUARY 15, 2020

As Uber’s operations became more complex and we offered additional features and … The post Engineering SQL Support on Apache Pinot at Uber appeared first on Uber Engineering Blog.

SQL

SQL Engineering Aggregated Data Hadoop

Expediting SQL Workers means Expediting your Business

Cloudera

NOVEMBER 10, 2020

Two of the more painful things in your everyday life as an analyst or SQL worker are not getting easy access to data when you need it, or not having easy to use, useful tools available to you that don’t get in your way! This simple statement captures the essence of almost 10 years of SQL development with modern data warehousing.

SQL

SQL Unstructured Data Hadoop Data Lake

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Cloudera

MAY 18, 2021

Prior the introduction of CDP Public Cloud, many organizations that wanted to leverage CDH, HDP or any other on-prem Hadoop runtime in the public cloud had to deploy the platform in a lift-and-shift fashion, commonly known as “Hadoop-on-IaaS” or simply the IaaS model. SQL-driven Streaming App Development. Introduction.

Hadoop

Hadoop Cloud AWS Utilities

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Data Engineering Podcast

SEPTEMBER 7, 2020

Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Presto is and its origin story?

Architecture

Architecture Data Architecture SQL Engineering

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support.

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Your first 30 days are free!

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center. In order to understand today's data engineering I think that this is important to at least know Hadoop concepts and context and computer science basics.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Apache Spark vs MapReduce: A Detailed Comparison

Knowledge Hut

MAY 2, 2024

The Pig has SQL-like syntax and it is easier for SQL developers to get on board easily. It also has rich Spark SQL APIs for SQL-savvy developers and it covers most of the SQL functions and is adding more functions with each new release. Apache Spark can be in standalone mode using the default scheduler.

Hadoop

Hadoop Scala Datasets Java

The Best Data Dictionary Tools in 2025

Monte Carlo

APRIL 28, 2025

It supports a ton of connectorsfrom SQL databases to machine learning modelsso if youre juggling different tools and platforms, this one can help bring everything together. Apache Atlas Source: Apache Atlas Apache Atlas is more enterprise-focused and really shines if youre in a Hadoop-heavy environment. Its simple, but it works.

Metadata

Metadata Hadoop Data SQL

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Data Engineering Podcast

JANUARY 13, 2019

Links TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Links TimescaleDB Original Appearance on the Data Engineering Podcast 1.0

Database

Database PostgreSQL SQL MongoDB

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Ozone natively provides Amazon S3 and Hadoop Filesystem compatible endpoints in addition to its own native object store API endpoint and is designed to work seamlessly with enterprise scale data warehousing, machine learning and streaming workloads. Spark SQL to access Hive table. STORED AS TEXTFILE. spark = SparkSession. .

Data Science

Data Science Cloud Hadoop Metadata

Top 8 Hadoop Projects to Work in 2024

Knowledge Hut

DECEMBER 28, 2023

That's where Hadoop comes into the picture. Hadoop is a popular open-source framework that stores and processes large datasets in a distributed manner. Organizations are increasingly interested in Hadoop to gain insights and a competitive advantage from their massive datasets. Why Are Hadoop Projects So Important?

Hadoop

Hadoop Project Big Data Datasets

Will Hadoop and Big Data replace traditional Data warehousing?

Knowledge Hut

MAY 20, 2024

The tools and techniques are proven, the SQL query language is well known, and there’s plenty of expertise available to keep EDWs humming. Enter Hadoop , which lets you store data on a massive scale at low cost (compared with similarly scaled commercial databases).

Hadoop

Hadoop Big Data BI Business Intelligence

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

Striim offers an out-of-the-box adapter for Snowflake to stream real-time data from enterprise databases (using low-impact change data capture ), log files from security devices and other systems, IoT sensors and devices, messaging systems, and Hadoop solutions, and provide in-flight transformation capabilities.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Data Engineering Podcast

JANUARY 6, 2019

Summary The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. For a perfect pairing, they made it easy to connect to the Impala SQL engine. How does it fit into the Hadoop ecosystem? Can you start by explaining what Kudu is and the motivation for building it?

Data Analytics

Data Analytics Hadoop Kafka Media

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Knowledge Hut

DECEMBER 21, 2023

To establish a career in big data, you need to be knowledgeable about some concepts, Hadoop being one of them. Hadoop tools are frameworks that help to process massive amounts of data and perform computation. You can learn in detail about Hadoop tools and technologies through a Big Data and Hadoop training online course.

Hadoop

Hadoop Big Data NoSQL Unstructured Data

Fundamentals of Apache Spark

Knowledge Hut

MAY 3, 2024

Spark offers over 80 high-level operators that make it easy to build parallel apps and one can use it interactively from the Scala, Python, R, and SQL shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. Basic knowledge of SQL. Yarn etc) Or, 2.

Hadoop

Hadoop Scala Healthcare Big Data

How to install Apache Spark on Windows?

Knowledge Hut

MAY 2, 2024

It also supports a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For the package type, choose ‘Pre-built for Apache Hadoop’ The page will look like the one below. For Hadoop 2.7,

Java

Java Hadoop Scala SQL

Inside Agoda’s Private Cloud - Exclusive

The Pragmatic Engineer

JUNE 13, 2023

Ten years ago, this data cluster was 300GB as a Hadoop cluster; that’s around a 100,000-fold increase in data stored! For transactional databases, it’s mostly the Microsoft SQL Server, but also other databases like PostgreSQL, ScyllaDB and Couchbase. The company runs 4 data centers: in the US and Europe, with two in Asia.

Cloud

Cloud Database Utilities BI

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Good old data warehouses like Oracle were engine + storage, then Hadoop arrived and was almost the same you had an engine (MapReduce, Pig, Hive, Spark) and HDFS, everything in the same cluster, with data co-location. you could write the same pipeline in Java, in Scala, in Python, in SQL, etc.—with 3) Spark 4.0

Metadata

Metadata Data Warehouse BI MySQL

Adopting Spark Connect

Towards Data Science

NOVEMBER 6, 2024

Spark has long allowed to run SQL queries on a remote Thrift JDBC server. The appropriate Spark dependencies (spark-core/spark-sql or spark-connect-client-jvm) will be provided later in the Java classpath, depending on the run mode. hadoop-aws since we almost always have interaction with S3 storage on the client side).

Scala

Scala Java AWS Coding

An Introduction to Ranger RMS

Cloudera

OCTOBER 5, 2021

This means many manually implemented Ranger HDFS policies, Hadoop ACLs, or POSIX permissions created solely for this purpose can now be removed, if desired. Instead, it generates a mapping that allows the Ranger Plugin in HDFS to make run-time decisions based on the Hadoop SQL grants.

Hadoop

Hadoop SQL Database Accessible

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Data Engineering Podcast

MAY 20, 2018

Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?

PostgreSQL

PostgreSQL Hadoop SQL Kafka

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

If you pursue the MSc big data technologies course, you will be able to specialize in topics such as Big Data Analytics, Business Analytics, Machine Learning, Hadoop and Spark technologies, Cloud Systems etc. There are a variety of big data processing technologies available, including Apache Hadoop, Apache Spark, and MongoDB.

Big Data

Big Data Technology Hadoop NoSQL

Apache Ozone Metadata Explained

Cloudera

JUNE 2, 2021

Apache Ozone is a distributed object store built on top of Hadoop Distributed Data Store service. In Ozone, HDDS (Hadoop Distributed Data Storage) layer including SCM and Datanodes provides a generic replication of containers/blocks without namespace metadata. var/lib/hadoop-ozone/scm/ozone-metadata/scm/(key|certs).

Metadata

Metadata Hadoop Certification Algorithm

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple Engineering

JULY 9, 2024

We recently embarked on a significant data platform migration, transitioning from Hadoop to Databricks, a move motivated by our relentless pursuit of excellence and our contributions to the XRP Ledger's (XRPL) data analytics. High maintenance costs and a system that struggled to meet the real-time demands of our data-driven initiatives.

Hadoop

Hadoop Data Lake Machine Learning Raw Data

SQL for Data Engineering: Success Blueprint for Data Engineers

ProjectPro

FEBRUARY 16, 2023

At the heart of these data engineering skills lies SQL that helps data engineers manage and manipulate large amounts of data. Did you know SQL is the top skill listed in 73.4% Almost all major tech organizations use SQL. According to the 2022 developer survey by Stack Overflow , Python is surpassed by SQL in popularity.

Data Engineering

Data Engineering Data Engineer SQL Engineering

Sentry to Ranger – A concise Guide

Cloudera

NOVEMBER 10, 2021

This blog post provides CDH users with a quick overview of Ranger as a Sentry replacement for Hadoop SQL policies in CDP. Apache Sentry is a role-based authorization module for specific components in Hadoop. It is useful in defining and enforcing different levels of privileges on data for users on a Hadoop cluster.

Hadoop

Hadoop SQL Database Kafka

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

In the early days, many companies simply used Apache Kafka ® for data ingestion into Hadoop or another data lake. Rockset supports JDBC and integrates with other SQL dashboards like Tableau, Grafana, and Apache Superset. However, Apache Kafka is more than just messaging. In the most critical use cases, every seconds counts.

Kafka

Kafka SQL BI Hadoop

How to Become a Data Engineer in 2024?

Knowledge Hut

DECEMBER 26, 2023

This job requires a handful of skills, starting from a strong foundation of SQL and programming languages like Python , Java , etc. Most of the Data engineers working in the field enroll themselves in several other training programs to learn an outside skill, such as Hadoop or Big Data querying, alongside their Master's degree and PhDs.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

A High Performance Platform For The Full Big Data Lifecycle

Data Engineering Podcast

AUGUST 19, 2019

One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity.

Big Data

Big Data Hadoop Data Lake Media

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Snowflake

OCTOBER 16, 2024

For organizations who are considering moving from a legacy data warehouse to Snowflake, are looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or are struggling with a cloud data warehouse that just isn’t scaling anymore, it often helps to see how others have done it.

Digital Media

Digital Media Media Data Lake Data Warehouse

Most Popular Programming Certifications for 2024

Knowledge Hut

DECEMBER 26, 2023

Most Popular Programming Certifications C & C++ Certifications Oracle Certified Associate Java Programmer OCAJP Certified Associate in Python Programming (PCAP) MongoDB Certified Developer Associate Exam R Programming Certification Oracle MySQL Database Administration Training and Certification (CMDBA) CCA Spark and Hadoop Developer 1.

Certification

Certification Programming MongoDB R (Programming)

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Data News — Week 23.14

Christophe Blefari

APRIL 8, 2023

I was in the Hadoop world and all I was doing was denormalisation. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms. Under the hood it uses sqlglot the SQL parser that has been developper by the same developper. Denormalisation everywhere. YAML configured.

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

Data News — Week 13.14

Christophe Blefari

APRIL 8, 2023

I was in the Hadoop world and all I was doing was denormalisation. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms. Under the hood it uses sqlglot the SQL parser that has been developper by the same developper. Denormalisation everywhere. YAML configured.

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Podcast

FEBRUARY 11, 2018

Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL? How is Timescale implemented and how has the internal architecture evolved since you first started working on it? What impact has the 10.0 What impact has the 10.0

PostgreSQL

PostgreSQL NoSQL Google Cloud MongoDB

SQL and Complex Queries Are Needed for Real-Time Analytics

Rockset

MAY 17, 2022

Limitations of NoSQL SQL supports complex queries because it is a very expressive, mature language. Complex SQL queries have long been commonplace in business intelligence (BI). And when systems such as Hadoop and Hive arrived, it married complex queries with big data for the first time. This is intentionally not their forte.

SQL

SQL NoSQL Hadoop MongoDB

Taking A Tour Of The Google Cloud Platform For Data And Analytics

Data Engineering Podcast

JUNE 11, 2021

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

Google Cloud

Google Cloud Cloud Big Data Ecosystem Data Warehouse

Top 8 Interview Questions on Apache Sqoop

Hadoop vs Spark: Main Big Data Tools Explained

Webinars

Trending Sources

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Webinars

How to get started with dbt

Engineering SQL Support on Apache Pinot at Uber

Expediting SQL Workers means Expediting your Business

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Reflecting On The Past 6 Years Of Data Engineering

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

How to learn data engineering

Apache Spark vs MapReduce: A Detailed Comparison

The Best Data Dictionary Tools in 2025

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Apache Ozone Powers Data Science in CDP Private Cloud

Top 8 Hadoop Projects to Work in 2024

Will Hadoop and Big Data replace traditional Data warehousing?

Modern Customer Data Platform Principles

5 Advantages of Real-Time ETL for Snowflake

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Fundamentals of Apache Spark

How to install Apache Spark on Windows?

Inside Agoda’s Private Cloud - Exclusive

Databricks, Snowflake and the future

Adopting Spark Connect

An Introduction to Ranger RMS

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Big Data Technologies that Everyone Should Know in 2024

Apache Ozone Metadata Explained

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

SQL for Data Engineering: Success Blueprint for Data Engineers

Sentry to Ranger – A concise Guide

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

How to Become a Data Engineer in 2024?

A High Performance Platform For The Full Big Data Lifecycle

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Most Popular Programming Certifications for 2024

The Rise of the Data Engineer

Data News — Week 23.14

Data News — Week 13.14

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

SQL and Complex Queries Are Needed for Real-Time Analytics

Taking A Tour Of The Google Cloud Platform For Data And Analytics

Stay Connected