Database, Hadoop and SQL - Data Engineering Digest

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

For organizations considering moving from a legacy data warehouse to Snowflake, looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or assessing new options if your current cloud data warehouse just isn’t scaling anymore, it helps to see how others have done it.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Hadoop and Spark are the two most popular platforms for Big Data processing. To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? scalability.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

Webinars

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. tests — a way to define SQL tests either at column-level, either with a query.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Data Engineering Podcast

JANUARY 13, 2019

How has the market for timeseries databases changed since we last spoke? How has the market for timeseries databases changed since we last spoke? Can you refresh our memory about what TimescaleDB is? What has changed in the focus and features of the TimescaleDB project and company? Toward the end of 2018 you launched the 1.0

Database

Database PostgreSQL SQL MongoDB

Expediting SQL Workers means Expediting your Business

Cloudera

NOVEMBER 10, 2020

Two of the more painful things in your everyday life as an analyst or SQL worker are not getting easy access to data when you need it, or not having easy to use, useful tools available to you that don’t get in your way! This simple statement captures the essence of almost 10 years of SQL development with modern data warehousing.

SQL

SQL Unstructured Data Hadoop Data Lake

Engineering SQL Support on Apache Pinot at Uber

Uber Engineering

JANUARY 15, 2020

As Uber’s operations became more complex and we offered additional features and … The post Engineering SQL Support on Apache Pinot at Uber appeared first on Uber Engineering Blog.

SQL

SQL Engineering Aggregated Data Hadoop

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

Data Engineering Podcast

DECEMBER 11, 2022

Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. No more shipping and praying, you can now know exactly what will change in your database! or any other destination you choose.

Database

Database MySQL Data Lake MongoDB

Accelerating ML Training And Delivery With In-Database Machine Learning

Data Engineering Podcast

JUNE 14, 2021

Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? Can you start by giving an overview of the current state of the market for databases that support in-process machine learning?

Machine Learning

Machine Learning Database Data Warehouse Hadoop

Operational Database Security – Part 2

Cloudera

SEPTEMBER 23, 2020

Database object security. Database object-level security is available through the centralized authorization framework of Apache Ranger. . Both fine-grained access control of database objects and access to metadata is provided. Protected database objects include: database, table, column, view and User Defined Functions (UDFs). .

Database

Database Data Lake Metadata Java

Inside Agoda’s Private Cloud - Exclusive

The Pragmatic Engineer

JUNE 13, 2023

Ten years ago, this data cluster was 300GB as a Hadoop cluster; that’s around a 100,000-fold increase in data stored! For transactional databases, it’s mostly the Microsoft SQL Server, but also other databases like PostgreSQL, ScyllaDB and Couchbase. It uses Spark for the data platform.

Cloud

Cloud Database Utilities BI

The Best Data Dictionary Tools in 2025

Monte Carlo

APRIL 28, 2025

It supports a ton of connectorsfrom SQL databases to machine learning modelsso if youre juggling different tools and platforms, this one can help bring everything together. Apache Atlas Source: Apache Atlas Apache Atlas is more enterprise-focused and really shines if youre in a Hadoop-heavy environment. Its simple, but it works.

Metadata

Metadata Hadoop Data SQL

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Data Engineering Podcast

SEPTEMBER 7, 2020

Summary Databases are limited in scope to the information that they directly contain. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today. This frequently requires cumbersome and time-consuming data integration.

Architecture

Architecture Data Architecture SQL Engineering

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Cloudera

MAY 18, 2021

Prior the introduction of CDP Public Cloud, many organizations that wanted to leverage CDH, HDP or any other on-prem Hadoop runtime in the public cloud had to deploy the platform in a lift-and-shift fashion, commonly known as “Hadoop-on-IaaS” or simply the IaaS model. SQL-driven Streaming App Development. Introduction.

Hadoop

Hadoop Cloud AWS Utilities

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

Look no further than Materialize, the streaming database you already know how to use. Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Look no further than Materialize, the streaming database you already know how to use.

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

System Requirements Support for Structured Data The growth of NoSQL databases has broadly been accompanied with the trend of data “schemalessness” (e.g., We have chosen the high data capacity and high performance Cassandra (C*) database as the backend implementation that serves as the source of truth for all our data.

Media

Media Database Metadata Data Schemas

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

Summary Databases and analytics architectures have gone through several generational shifts. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

A streaming ETL for Snowflake approach loads data to Snowflake from diverse sources such as transactional databases, security systems logs, and IoT sensors/devices in real time , while simultaneously meeting scalability, latency, security, and reliability requirements.

MongoDB

MongoDB Data Warehouse MySQL Hadoop

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Good old data warehouses like Oracle were engine + storage, then Hadoop arrived and was almost the same you had an engine (MapReduce, Pig, Hive, Spark) and HDFS, everything in the same cluster, with data co-location. you could write the same pipeline in Java, in Scala, in Python, in SQL, etc.—with 3) Spark 4.0

Metadata

Metadata Data Warehouse BI MySQL

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center. In order to understand today's data engineering I think that this is important to at least know Hadoop concepts and context and computer science basics.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Snowflake

OCTOBER 16, 2024

For organizations who are considering moving from a legacy data warehouse to Snowflake, are looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or are struggling with a cloud data warehouse that just isn’t scaling anymore, it often helps to see how others have done it.

Digital Media

Digital Media Media Data Lake Data Warehouse

An Introduction to Ranger RMS

Cloudera

OCTOBER 5, 2021

This means many manually implemented Ranger HDFS policies, Hadoop ACLs, or POSIX permissions created solely for this purpose can now be removed, if desired. Instead, it generates a mapping that allows the Ranger Plugin in HDFS to make run-time decisions based on the Hadoop SQL grants.

Hadoop

Hadoop SQL Database Accessibility

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

If you pursue the MSc big data technologies course, you will be able to specialize in topics such as Big Data Analytics, Business Analytics, Machine Learning, Hadoop and Spark technologies, Cloud Systems etc. There are a variety of big data processing technologies available, including Apache Hadoop, Apache Spark, and MongoDB.

Big Data

Big Data Technology Hadoop NoSQL

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Evolution of Open Table Formats Here’s a timeline that outlines the key moments in the evolution of open table formats: 2008 - Apache Hive and Hive Table Format Facebook introduced Apache Hive as one of the first table formats as part of its data warehousing infrastructure, built on top of Hadoop.

Architecture

Architecture Systems Data Lake Google Cloud

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

The foundational skills are similar between traditional data engineers and AI data engineers are similar, with AI data engineers more heavily focused on machine learning data infrastructure, AI-specific tools, vector databases, and LLM pipelines. Let’s dive into the tools necessary to become an AI data engineer.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Recap of Hadoop News for February 2018

ProjectPro

MARCH 1, 2018

News on Hadoop - February 2018 Kyvos Insights to Host Webinar on Accelerating Business Intelligence with Native Hadoop BI Platforms. The leading big data analytics company Kyvo Insights is hosting a webinar titled “Accelerate Business Intelligence with Native Hadoop BI platforms.” PRNewswire.com, February 1, 2018.

Hadoop

Hadoop NoSQL Retail BI

Sentry to Ranger – A concise Guide

Cloudera

NOVEMBER 10, 2021

This blog post provides CDH users with a quick overview of Ranger as a Sentry replacement for Hadoop SQL policies in CDP. Apache Sentry is a role-based authorization module for specific components in Hadoop. It is useful in defining and enforcing different levels of privileges on data for users on a Hadoop cluster.

Hadoop

Hadoop SQL Database Kafka

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Podcast

FEBRUARY 11, 2018

The landscape of time series databases is extensive and oftentimes difficult to navigate. Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL? The landscape of time series databases is extensive and oftentimes difficult to navigate. What impact has the 10.0

PostgreSQL

PostgreSQL NoSQL Google Cloud MongoDB

Most Popular Programming Certifications for 2024

Knowledge Hut

DECEMBER 26, 2023

Most Popular Programming Certifications C & C++ Certifications Oracle Certified Associate Java Programmer OCAJP Certified Associate in Python Programming (PCAP) MongoDB Certified Developer Associate Exam R Programming Certification Oracle MySQL Database Administration Training and Certification (CMDBA) CCA Spark and Hadoop Developer 1.

Certification

Certification Programming MongoDB R (Programming)

Top 10 Hadoop Tools to Learn in Big Data Career 2024

Knowledge Hut

DECEMBER 21, 2023

To establish a career in big data, you need to be knowledgeable about some concepts, Hadoop being one of them. Hadoop tools are frameworks that help to process massive amounts of data and perform computation. You can learn in detail about Hadoop tools and technologies through a Big Data and Hadoop training online course.

Hadoop

Hadoop Big Data NoSQL Unstructured Data

How to Become a Data Engineer in 2024?

Knowledge Hut

DECEMBER 26, 2023

This job requires a handful of skills, starting from a strong foundation of SQL and programming languages like Python , Java , etc. Data Engineers are skilled professionals who lay the foundation of databases and architecture. Data engineers who focus on databases work with data warehouses and develop different table schemas.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Data News — Week 22.45

Christophe Blefari

NOVEMBER 11, 2022

Mastodon and Hadoop are on a boat. Introduction to Snowflake's Micro-Partitions — I think that explaination about databases internals are my favourite tech articles. The SaaS app connects to your warehouse and displays your data in a tabular format after a query (graphical built or SQL). Which, yeah, kinda sucks.

BI

BI Data Warehouse Data Database

Top 8 Hadoop Projects to Work in 2024

Knowledge Hut

DECEMBER 28, 2023

That's where Hadoop comes into the picture. Hadoop is a popular open-source framework that stores and processes large datasets in a distributed manner. Organizations are increasingly interested in Hadoop to gain insights and a competitive advantage from their massive datasets. Why Are Hadoop Projects So Important?

Hadoop

Hadoop Project Big Data Datasets

Will Hadoop and Big Data replace traditional Data warehousing?

Knowledge Hut

MAY 20, 2024

The tools and techniques are proven, the SQL query language is well known, and there’s plenty of expertise available to keep EDWs humming. Enter Hadoop , which lets you store data on a massive scale at low cost (compared with similarly scaled commercial databases).

Hadoop

Hadoop Big Data BI Business Intelligence

Apache Spark vs MapReduce: A Detailed Comparison

Knowledge Hut

MAY 2, 2024

To store and process even only a fraction of this amount of data, we need Big Data frameworks as traditional Databases would not be able to store so much data nor traditional processing systems would be able to process this data quickly. The Pig has SQL-like syntax and it is easier for SQL developers to get on board easily.

Hadoop

Hadoop Scala Datasets Java

SQL for Data Engineering: Success Blueprint for Data Engineers

ProjectPro

FEBRUARY 16, 2023

At the heart of these data engineering skills lies SQL that helps data engineers manage and manipulate large amounts of data. Did you know SQL is the top skill listed in 73.4% Almost all major tech organizations use SQL. According to the 2022 developer survey by Stack Overflow , Python is surpassed by SQL in popularity.

Data Engineering

Data Engineering Data Engineer SQL Engineering

A High Performance Platform For The Full Big Data Lifecycle

Data Engineering Podcast

AUGUST 19, 2019

One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity.

Big Data

Big Data Hadoop Data Lake Media

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

ACID transactions, ANSI 2016 SQL SupportMajor Performance improvements. This customer’s workloads leverage batch processing of data from 100+ backend database sources like Oracle, SQL Server, and traditional Mainframes using Syncsort. Document the operating system versions, database versions, and JDK versions.

Cloud

Cloud Kafka Professional Services Metadata

Data News — Week 23.14

Christophe Blefari

APRIL 8, 2023

I was in the Hadoop world and all I was doing was denormalisation. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms. Under the hood it uses sqlglot the SQL parser that has been developper by the same developper. Denormalisation everywhere. YAML configured. Roboto AI raises $4.8m

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

Data News — Week 13.14

Christophe Blefari

APRIL 8, 2023

I was in the Hadoop world and all I was doing was denormalisation. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms. Under the hood it uses sqlglot the SQL parser that has been developper by the same developper. Denormalisation everywhere. YAML configured. Roboto AI raises $4.8m

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

Scale Your Analytics On The Clickhouse Data Warehouse

Data Engineering Podcast

JULY 8, 2019

ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. Where does it fit in the database market and how does it compare to other column stores, both open source and commercial? What are some of the advanced capabilities, such as SQL extensions, supported data types, etc.

Data Warehouse

Data Warehouse MySQL Hadoop Data Lake

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

In the early days, many companies simply used Apache Kafka ® for data ingestion into Hadoop or another data lake. Rockset supports JDBC and integrates with other SQL dashboards like Tableau, Grafana, and Apache Superset. What if mainframes, databases, logs, or sensor data are involved in your use case?

Kafka

Kafka SQL BI Hadoop

Top 30 Data Scientist Skills to Master in 2024

Knowledge Hut

DECEMBER 22, 2023

Hadoop Gigabytes to petabytes of data may be stored and processed effectively using the open-source framework known as Apache Hadoop. Hadoop enables the clustering of many computers to examine big datasets in parallel more quickly than a single powerful machine for data storage and processing. Packages and Software OpenCV.

Hadoop

Hadoop Deep Learning Data Science Machine Learning

Top 8 Interview Questions on Apache Sqoop

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Webinars

Trending Sources

Hadoop vs Spark: Main Big Data Tools Explained

Webinars

How to get started with dbt

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Expediting SQL Workers means Expediting your Business

Engineering SQL Support on Apache Pinot at Uber

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

Accelerating ML Training And Delivery With In-Database Machine Learning

Operational Database Security – Part 2

Inside Agoda’s Private Cloud - Exclusive

The Best Data Dictionary Tools in 2025

Simplify Your Data Architecture With The Presto Distributed SQL Engine

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Reflecting On The Past 6 Years Of Data Engineering

Implementing the Netflix Media Database

Modern Customer Data Platform Principles

5 Advantages of Real-Time ETL for Snowflake

Databricks, Snowflake and the future

How to learn data engineering

Snowflake Migration Success Stories: Core Digital Media and NAVEX

An Introduction to Ranger RMS

Big Data Technologies that Everyone Should Know in 2024

The Rise of the Data Engineer

Why Open Table Format Architecture is Essential for Modern Data Systems

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Recap of Hadoop News for February 2018

Sentry to Ranger – A concise Guide

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Most Popular Programming Certifications for 2024

Top 10 Hadoop Tools to Learn in Big Data Career 2024

How to Become a Data Engineer in 2024?

Data News — Week 22.45

Top 8 Hadoop Projects to Work in 2024

Will Hadoop and Big Data replace traditional Data warehousing?

Apache Spark vs MapReduce: A Detailed Comparison

SQL for Data Engineering: Success Blueprint for Data Engineers

A High Performance Platform For The Full Big Data Lifecycle

Upgrade Journey: The Path from CDH to CDP Private Cloud

Data News — Week 23.14

Data News — Week 13.14

Scale Your Analytics On The Clickhouse Data Warehouse

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Top 30 Data Scientist Skills to Master in 2024

Stay Connected