Hadoop - Data Engineering Digest

A Beginner’s Guide to the Basics of Big Data and Hadoop

Analytics Vidhya

FEBRUARY 5, 2023

Big data […] The post A Beginner’s Guide to the Basics of Big Data and Hadoop appeared first on Analytics Vidhya. Big data is nothing but the vast volume of datasets measured in terabytes or petabytes or even more.

Hadoop

Hadoop Big Data Datasets Data

Containerizing Apache Hadoop Infrastructure at Uber

Uber Engineering

JULY 22, 2021

As Uber’s business grew, we scaled our Apache Hadoop (referred to as ‘Hadoop’ in this article) deployment to 21000+ hosts in 5 years, to support the various analytical and machine learning use cases. Introduction.

Hadoop

Hadoop Machine Learning Engineering Architecture

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Unapologetically Technical Episode 18 – Adrian Woodhead

Jesse Anderson

MARCH 18, 2025

In this episode of Unapologetically Technical, I interview Adrian Woodhead, a distinguished software engineer at Human and a true trailblazer in the European Hadoop ecosystem. ” Dont forget to subscribe to my YouTube channel to get the latest on Unapologetically Technical!

Hadoop

Hadoop Software Engineering Software Engineer Data Engineering

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

But is it truly revolutionary, or is it destined to repeat the pitfalls of past solutions like Hadoop? Danny authored a thought-provoking article comparing Iceberg to Hadoop , not on a purely technical level, but in terms of their hype cycles, implementation challenges, and the surrounding ecosystems.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

Then came Big Data and Hadoop! The big data boom was born, and Hadoop was its poster child. The promise of Hadoop was that organizations could securely upload and economically distribute massive batch files of any data across a cluster of computers. A data lake!

Data Integration

Data Integration Hadoop Data Warehouse Data Lake

YARN for Large Scale Computing: Beginner’s Edition

Analytics Vidhya

JANUARY 31, 2023

It is designed to be more flexible and generic than the original Hadoop MapReduce system, making it an attractive choice for companies looking to implement Hadoop. It is a powerful resource management system for a horizontal server environment.

Hadoop

Hadoop Designing Systems Management

An Ultimate Manual to Apache Oozie

Analytics Vidhya

FEBRUARY 2, 2023

Hadoop, the Open-Source Software Framework for scalable and scattered computation of massive data sets, makes it easy. Introduction Big data processing is crucial today. Big data analytics and learning help corporations foresee client demands, provide useful recommendations, and more.

Hadoop

Hadoop Big Data Data Analytics Data Process

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Hadoop and Spark are the two most popular platforms for Big Data processing. To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? scalability.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

Containerizing the Beast – Hadoop NameNodes in Uber’s Infrastructure

Uber Engineering

JANUARY 26, 2023

We recently containerized Hadoop NameNodes and upgraded hardware, improving NameNode RPC queue time from ~200 to ~20ms – A 10x improvement! With this radical change, Uber’s Hadoop customers are happier and admins rest more at night.

Hadoop

Hadoop Data

A Dive into the Basics of Big Data Storage with HDFS

Analytics Vidhya

FEBRUARY 6, 2023

Introduction HDFS (Hadoop Distributed File System) is not a traditional database but a distributed file system designed to store and process big data. It is a core component of the Apache Hadoop ecosystem and allows for storing and processing large datasets across multiple commodity servers.

Data Storage

Data Storage Big Data Hadoop Datasets

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

For organizations considering moving from a legacy data warehouse to Snowflake, looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or assessing new options if your current cloud data warehouse just isn’t scaling anymore, it helps to see how others have done it.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Uber Engineering

OCTOBER 27, 2024

Ready to boost your Hadoop Data Lake security on GCP? Our latest blog dives into enabling security for Uber’s modernized batch data lake on Google Cloud Storage!

Cloud Storage

Cloud Storage Google Cloud Data Lake Hadoop

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Cloudera

MAY 18, 2021

Prior the introduction of CDP Public Cloud, many organizations that wanted to leverage CDH, HDP or any other on-prem Hadoop runtime in the public cloud had to deploy the platform in a lift-and-shift fashion, commonly known as “Hadoop-on-IaaS” or simply the IaaS model. Introduction. Acknowledgment.

Hadoop

Hadoop Cloud AWS Utilities

Why you should not learn everything in Data Science

Team Data Science

SEPTEMBER 1, 2020

So, let's bring Hadoop into play here. Everyone suddenly started talking about Hadoop. Everyone should learn Hadoop. There was a time when people said, "Okay, let's look at Hadoop and become a Hadoop expert. There was a time when people said, "Okay, let's look at Hadoop and become a Hadoop expert.

Data Science

Data Science Hadoop Kafka Big Data

Top 8 Hadoop Projects to Work in 2024

Knowledge Hut

DECEMBER 28, 2023

That's where Hadoop comes into the picture. Hadoop is a popular open-source framework that stores and processes large datasets in a distributed manner. Organizations are increasingly interested in Hadoop to gain insights and a competitive advantage from their massive datasets. Why Are Hadoop Projects So Important?

Hadoop

Hadoop Project Big Data Datasets

Will Hadoop and Big Data replace traditional Data warehousing?

Knowledge Hut

MAY 20, 2024

Enter Hadoop , which lets you store data on a massive scale at low cost (compared with similarly scaled commercial databases). That sounds great, but where do you find qualified people who know how to use Pig, Hive, Scoop and other tools needed to run Hadoop?

Hadoop

Hadoop Big Data BI Business Intelligence

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Monte Carlo

NOVEMBER 12, 2024

Uber stores its data in a combination of Hadoop and Cassandra for high availability and low latency access. When you request a ride, Uber grabs your location and streams it through Kafka to Flink. Flink then gets to work finding the nearest available driver and calculating your fare.

Architecture

Architecture Data Engineering Data Engineer Engineering

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Knowledge Hut

DECEMBER 21, 2023

To establish a career in big data, you need to be knowledgeable about some concepts, Hadoop being one of them. Hadoop tools are frameworks that help to process massive amounts of data and perform computation. You can learn in detail about Hadoop tools and technologies through a Big Data and Hadoop training online course.

Hadoop

Hadoop Big Data NoSQL Unstructured Data

Stitching Together Enterprise Analytics With Microsoft Fabric

Data Engineering Podcast

JUNE 23, 2024

Links Microsoft Fabric Ahana episode DB2 Distributed Spark Presto Azure Data MAD Landscape Podcast Episode ML Podcast Episode Tableau dbt Medallion Architecture Microsoft Onelake ORC Parquet Avro Delta Lake Iceberg Podcast Episode Hudi Podcast Episode Hadoop PowerBI Podcast Episode Velox Gluten Apache XTable GraphQL Formula 1 McLaren The intro and (..)

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center. In order to understand today's data engineering I think that this is important to at least know Hadoop concepts and context and computer science basics.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Addressing the Elephant in the Room – Welcome to Today’s Cloudera

Cloudera

JUNE 13, 2024

The first time that I really became familiar with this term was at Hadoop World in New York City some ten or so years ago. But, let’s make one thing clear – we are no longer that Hadoop company. But, What Happened to Hadoop? This was the gold rush of the 21st century, except the gold was data.

Hadoop

Hadoop Big Data Banking Insurance

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Ozone natively provides Amazon S3 and Hadoop Filesystem compatible endpoints in addition to its own native object store API endpoint and is designed to work seamlessly with enterprise scale data warehousing, machine learning and streaming workloads. STORED AS TEXTFILE. location 'ofs://ozone1/s3v/spark-bucket/vaccine-dataset'.

Data Science

Data Science Cloud Hadoop Metadata

Securely Scaling Big Data Access Controls At Pinterest

Pinterest Engineering

JULY 25, 2023

In this post, we focus on how we enhanced and extended Monarch , Pinterest’s Hadoop based batch processing system, with FGAC capabilities. In the next section, we elaborate how we integrated CVS into Hadoop to provide FGAC capabilities for our Big Data platform. QueryBook uses OAuth to authenticate users.

Big Data

Big Data Accessible Accessibility Hadoop

Apache Ozone – A Multi-Protocol Aware Storage System

Cloudera

NOVEMBER 7, 2023

Apache Ozone is compatible with Amazon S3 and Hadoop FileSystem protocols and provides bucket layouts that are optimized for both Object Store and File system semantics. Bucket layouts provide a single Ozone cluster with the capabilities of both a Hadoop Compatible File System (HCFS) and Object Store (like Amazon S3).

Systems

Systems Hadoop Unstructured Data Media

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years Interview Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as (..)

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Apache Ozone Metadata Explained

Cloudera

JUNE 2, 2021

Apache Ozone is a distributed object store built on top of Hadoop Distributed Data Store service. In Ozone, HDDS (Hadoop Distributed Data Storage) layer including SCM and Datanodes provides a generic replication of containers/blocks without namespace metadata. var/lib/hadoop-ozone/scm/ozone-metadata/scm/(key|certs).

Metadata

Metadata Hadoop Certification Algorithm

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. In this resource hub I'll mainly focus on dbt Core— i.e. dbt. First let's understand why dbt exists. This switch has been lead by modern data stack vision.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

If you pursue the MSc big data technologies course, you will be able to specialize in topics such as Big Data Analytics, Business Analytics, Machine Learning, Hadoop and Spark technologies, Cloud Systems etc. There are a variety of big data processing technologies available, including Apache Hadoop, Apache Spark, and MongoDB.

Big Data

Big Data Technology Hadoop NoSQL

Apache Spark vs MapReduce: A Detailed Comparison

Knowledge Hut

MAY 2, 2024

Compatibility MapReduce is also compatible with all data sources and file formats Hadoop supports. Spark is developed in Scala language and it can run on Hadoop in standalone mode using its own default resource manager as well as in Cluster mode using YARN or Mesos resource manager. Spark is a bit bare at the moment.

Hadoop

Hadoop Scala Datasets Java

How to develop Spark applications with Zeppelin notebooks

Team Data Science

MAY 23, 2020

You can run it on a server and you can run it on your Hadoop cluster or whatever. Especially working with dataframes and SparkSQL is a blast. What is a Zeppelin? A Zeppelin is a tool, a notebook tool, just like Jupiter. And it can run Spark jobs in the background. Advantages of Zeppelin The nice thing about it is that you have your notebook.

Hadoop

Hadoop Data Engineering Data Engineer Coding

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple Engineering

JULY 9, 2024

We recently embarked on a significant data platform migration, transitioning from Hadoop to Databricks, a move motivated by our relentless pursuit of excellence and our contributions to the XRP Ledger's (XRPL) data analytics. High maintenance costs and a system that struggled to meet the real-time demands of our data-driven initiatives.

Hadoop

Hadoop Data Lake Machine Learning Raw Data

How Marriott Modernized Their Data Architecture with Snowflake

Snowflake

SEPTEMBER 14, 2023

Prior to 2019, Marriott was an early adopter of Netezza and Hadoop, leveraging the IBM BigInsights platform. Data that previously took 48 hours to one week in Hadoop is now available near-instantly in Snowflake. As Marriott’s business has grown over the past century, its data infrastructure has become more complex.

Data Architecture

Data Architecture Architecture Hadoop Data Warehouse

How to install Apache Spark on Windows?

Knowledge Hut

MAY 2, 2024

For the package type, choose ‘Pre-built for Apache Hadoop’ The page will look like the one below. Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, Step 2: Once the download is completed, unzip the file, unzip the file using WinZip or WinRAR, or 7-ZIP. Add %SPARK_HOME%bin to the path variable.

Java

Java Hadoop Scala SQL

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com ) with your story. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com ) with your story.

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

Most Popular Programming Certifications for 2024

Knowledge Hut

DECEMBER 26, 2023

Most Popular Programming Certifications C & C++ Certifications Oracle Certified Associate Java Programmer OCAJP Certified Associate in Python Programming (PCAP) MongoDB Certified Developer Associate Exam R Programming Certification Oracle MySQL Database Administration Training and Certification (CMDBA) CCA Spark and Hadoop Developer 1.

Certification

Certification Programming MongoDB R (Programming)

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

OCTOBER 15, 2021

In this blog post, we will look into benchmark test results measuring the performance of Apache Hadoop Teragen and a directory/file rename operation with Apache Ozone (native o3fs) vs. Ozone S3 API*. We ran Apache Hadoop Teragen benchmark tests in a conventional Hadoop stack consisting of YARN and HDFS side by side with Apache Ozone.

Cloud

Cloud Hadoop Data Analytics Metadata

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 19, 2023

Email hosts@dataengineeringpodcast.com ) with your story. Email hosts@dataengineeringpodcast.com ) with your story.

IT

IT Data Lake Metadata Data Warehouse

A Flexible and Efficient Storage System for Diverse Workloads

Cloudera

SEPTEMBER 15, 2022

It was designed as a native object store to provide extreme scale, performance, and reliability to handle multiple analytics workloads using either S3 API or the traditional Hadoop API. In this blog post, we will talk about a single Ozone cluster with the capabilities of both Hadoop Core File System (HCFS) and Object Store (like Amazon S3).

Systems

Systems Hadoop Metadata Telecommunication

The Best Data Dictionary Tools in 2025

Monte Carlo

APRIL 28, 2025

Apache Atlas Source: Apache Atlas Apache Atlas is more enterprise-focused and really shines if youre in a Hadoop-heavy environment. It supports a ton of connectorsfrom SQL databases to machine learning modelsso if youre juggling different tools and platforms, this one can help bring everything together.

Metadata

Metadata Hadoop Data SQL

Adopting Spark Connect

Towards Data Science

NOVEMBER 6, 2024

hadoop-aws since we almost always have interaction with S3 storage on the client side). FROM openjdk:11-jre-slim WORKDIR /app # Here, we copy the common artifacts required for any of our Spark Connect # clients (primarily spark-connect-client-jvm, as well as spark-hive, # hadoop-aws, scala-library, etc.).

Scala

Scala Java AWS Coding

Top 10 Hadoop Interview Questions You Must Know

A Beginner’s Guide to the Basics of Big Data and Hadoop

Webinars

Trending Sources

Containerizing Apache Hadoop Infrastructure at Uber

Webinars

Unapologetically Technical Episode 18 – Adrian Woodhead

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Integrity for AI: What’s Old is New Again

YARN for Large Scale Computing: Beginner’s Edition

Top 8 Interview Questions on Apache Sqoop

Top 5 Interview Questions on Apache Oozie

Top 6 Microsoft HDFS Interview Questions

An Ultimate Manual to Apache Oozie

Hadoop vs Spark: Main Big Data Tools Explained

Containerizing the Beast – Hadoop NameNodes in Uber’s Infrastructure

A Dive into the Basics of Big Data Storage with HDFS

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Enabling Security for Hadoop Data Lake on Google Cloud Storage

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Why you should not learn everything in Data Science

Top 8 Hadoop Projects to Work in 2024

Will Hadoop and Big Data replace traditional Data warehousing?

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Stitching Together Enterprise Analytics With Microsoft Fabric

How to learn data engineering

Addressing the Elephant in the Room – Welcome to Today’s Cloudera

Apache Ozone Powers Data Science in CDP Private Cloud

Securely Scaling Big Data Access Controls At Pinterest

Apache Ozone – A Multi-Protocol Aware Storage System

Reflecting On The Past 6 Years Of Data Engineering

Apache Ozone Metadata Explained

How to get started with dbt

Big Data Technologies that Everyone Should Know in 2024

Apache Spark vs MapReduce: A Detailed Comparison

How to develop Spark applications with Zeppelin notebooks

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

How Marriott Modernized Their Data Architecture with Snowflake

How to install Apache Spark on Windows?

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Most Popular Programming Certifications for 2024

Apache Ozone – A High Performance Object Store for CDP Private Cloud

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

A Flexible and Efficient Storage System for Diverse Workloads

The Best Data Dictionary Tools in 2025

Adopting Spark Connect

Stay Connected