Engineering, Hadoop and SQL - Data Engineering Digest

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. Jinja templating — Jinja is a templating engine that seems to exist forever in Python.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Learn data engineering, all the references ( credits ) This is a special edition of the Data News. But right now I'm in holidays finishing a hiking week in Corsica 🥾 So I wrote this special edition about: how to learn data engineering in 2024. The idea is to create a living reference about Data Engineering.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Webinars

Apache Airflow®: The Ultimate Guide to DAG Writing

MORE WEBINARS

Brief History of Data Engineering

Jesse Anderson

DECEMBER 12, 2022

Doug Cutting took those papers and created Apache Hadoop in 2005. They were the first companies to commercialize open source big data technologies and pushed the marketing and commercialization of Hadoop. Hadoop was hard to program, and Apache Hive came along in 2010 to add SQL. We lacked a scalable pub/sub system.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Most Essential 2023 Interview Questions on Data Engineering

Analytics Vidhya

FEBRUARY 7, 2023

Introduction Data engineering is the field of study that deals with the design, construction, deployment, and maintenance of data processing systems. This includes designing and implementing […] The post Most Essential 2023 Interview Questions on Data Engineering appeared first on Analytics Vidhya.

Data Engineering

Data Engineering Data Engineer Engineering Data

Engineering SQL Support on Apache Pinot at Uber

Uber Engineering

JANUARY 15, 2020

As Uber’s operations became more complex and we offered additional features and … The post Engineering SQL Support on Apache Pinot at Uber appeared first on Uber Engineering Blog.

SQL

SQL Engineering Aggregated Data Hadoop

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

In that time there have been a number of generational shifts in how data engineering is done. Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Look no further than Materialize, the streaming database you already know how to use.

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Hadoop and Spark are the two most popular platforms for Big Data processing. To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? To come to the right decision, we need to divide this big question into several smaller ones — namely: What is Hadoop? scalability.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Data Engineering Podcast

SEPTEMBER 7, 2020

To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering?

Architecture

Architecture Data Architecture SQL Engineering

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer. We were data engineers! Data Engineering? At the same time, data engineering was the slightly younger sibling, but it was going through something similar. I wasn’t promoted or assigned to this new role.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data Hadoop Data Pipeline

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. Starburst Logo]([link] This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

Upgrade your Modern Data Stack

Christophe Blefari

SEPTEMBER 28, 2023

The era of Big Data was characterised by Hadoop, HDFS, distributed computing (Spark), above the JVM. We jumped from HDFS to Cloud Storage (S3, GCS) for storage and from Hadoop, Spark to Cloud warehouses (Redshift, BigQuery, Snowflake) for processing. I often say that data engineering is boring, insanely boring.

Big Data

Big Data Cloud Storage Hadoop SQL

Data Engineering Weekly #173

Data Engineering Weekly

MAY 26, 2024

[link] Meta: Composable data management at Meta Meta writes about its transition to a composable data management system to improve interoperability, reusability, and engineering efficiency. It is refreshing to see an open stack after the Hadoop era. and why one prefer dataframe over sql.

Data Engineering

Data Engineering Data Engineer Engineering Google Cloud

Maintain Your Data Engineers' Sanity By Embracing Automation

Data Engineering Podcast

JULY 10, 2022

Summary Building and maintaining reliable data assets is the prime directive for data engineers. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.

Data Engineering

Data Engineering Data Engineer Engineering MongoDB

Data Scientist vs Data Engineer: Differences and Why You Need Both

AltexSoft

OCTOBER 30, 2021

Was Nikola Tesla a scientist or engineer? These men didn’t stop at scientific research and ended up conceptualizing or engineering their inventions. Engineers are not only the ones bearing helmets and operating on construction sites. Data science vs data engineering. Here, data scientists are supported by data engineers.

Data Engineering

Data Engineering Data Engineer Engineering Machine Learning

Data Engineering Weekly #201

Data Engineering Weekly

DECEMBER 15, 2024

[link] Dani: Apache Iceberg: The Hadoop of the Modern Data Stack? The comment on Iceber, a Hadoop of the modern data stack, surprises me. Iceberg has not reduced the complexity of the data stack, and all the legacy Hadoop complexity still exists on top of Apache Iceberg. However, I 100% agree with the complex stack to maintain.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Data News — Week 24.08

Christophe Blefari

FEBRUARY 23, 2024

This week I've participated to a round table about data and did a cool presentation about Engines. The idea was to depict the history of engines over the last 40 years and what leads to polars and DuckDB. Engines evolution (me) There are 3 points that have triggered discussion about the visualisation I done What about Arrow?

Data Lake

Data Lake PostgreSQL MongoDB MySQL

A Reflection On Learning A Lot More Than 97 Things Every Data Engineer Should Know

Data Engineering Podcast

JANUARY 30, 2022

Summary The Data Engineering Podcast has been going for five years now and has included conversations and interviews with a huge number of guests, covering a broad range of topics. In this episode he shares some reflections on producing the podcast, compiling the book, and relevant trends in the ecosystem of data engineering.

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Data Engineering Podcast

JANUARY 13, 2019

Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode.

Database

Database PostgreSQL SQL MongoDB

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

One way to read data platforms When we look at platforms history what characterises evolution is the separation (or not) between the engine and the storage. A UX where you buy a single tool combining engine and storage, where all you have to do is flow data in, write SQL, and it's done.

Metadata

Metadata Data Warehouse BI MySQL

The Good and the Bad of Hadoop Big Data Framework

AltexSoft

JULY 29, 2022

Depending on how you measure it, the answer will be 11 million newspaper pages or… just one Hadoop cluster and one tech specialist who can move 4 terabytes of textual data to a new location in 24 hours. The Hadoop toy. So the first secret to Hadoop’s success seems clear — it’s cute. What is Hadoop?

Hadoop

Hadoop Big Data Google Cloud NoSQL

Best Morgan Stanley Data Engineer Interview Questions

U-Next

MARCH 1, 2023

Introduction Data Engineer is responsible for managing the flow of data to be used to make better business decisions. A solid understanding of relational databases and SQL language is a must-have skill, as an ability to manipulate large amounts of data effectively. In 2022, data engineering will hold a share of 29.8%

Data Engineering

Data Engineering Data Engineer Non-relational Database Engineering

Apache Spark vs MapReduce: A Detailed Comparison

Knowledge Hut

MAY 2, 2024

But at last, 5 years or so with Apache Spark gaining more ground, demand for MapReduce as the processing engine has reduced. The Pig has SQL-like syntax and it is easier for SQL developers to get on board easily. Compatibility MapReduce is also compatible with all data sources and file formats Hadoop supports.

Hadoop

Hadoop Scala Datasets Java

Inside Agoda’s Private Cloud - Exclusive

The Pragmatic Engineer

JUNE 13, 2023

👋 Hi, this is Gergely with the monthly, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. The majority of the engineering team is in Bangkok, Thailand. It uses Spark for the data platform.

Cloud

Cloud Database Utilities BI

Top 8 Hadoop Projects to Work in 2024

Knowledge Hut

DECEMBER 28, 2023

That's where Hadoop comes into the picture. Hadoop is a popular open-source framework that stores and processes large datasets in a distributed manner. Organizations are increasingly interested in Hadoop to gain insights and a competitive advantage from their massive datasets. Why Are Hadoop Projects So Important?

Hadoop

Hadoop Project Datasets Big Data

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Data Engineering Podcast

JANUARY 6, 2019

Summary The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. For a perfect pairing, they made it easy to connect to the Impala SQL engine. How does it fit into the Hadoop ecosystem? Can you start by explaining what Kudu is and the motivation for building it?

Data Analytics

Data Analytics Hadoop Kafka Media

Data Engineers of Netflix?—?Interview with Kevin Wylie

Netflix Tech

JULY 15, 2021

Data Engineers of Netflix?—?Interview Interview with Kevin Wylie This post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. Kevin Wylie is a Data Engineer on the Content Data Science and Engineering team.

Data Engineering

Data Engineering Data Engineer Engineering Entertainment

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

If you pursue the MSc big data technologies course, you will be able to specialize in topics such as Big Data Analytics, Business Analytics, Machine Learning, Hadoop and Spark technologies, Cloud Systems etc. There are a variety of big data processing technologies available, including Apache Hadoop, Apache Spark, and MongoDB.

Big Data

Big Data Technology Hadoop NoSQL

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Data Engineering Podcast

MAY 20, 2018

Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?

PostgreSQL

PostgreSQL Hadoop SQL Kafka

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple Engineering

JULY 9, 2024

We recently embarked on a significant data platform migration, transitioning from Hadoop to Databricks, a move motivated by our relentless pursuit of excellence and our contributions to the XRP Ledger's (XRPL) data analytics. High maintenance costs and a system that struggled to meet the real-time demands of our data-driven initiatives.

Hadoop

Hadoop Data Lake Machine Learning Raw Data

Data News — Week 24.40

Christophe Blefari

OCTOBER 6, 2024

Still I hope this edition finds you well, it will be a mix of personal news, OpenAI saga and usual data engineering stuff that I enjoy reading. I really like the program we put in place, it a mix of Engineering and Strategic / Vision talks. Current state of Databricks SQL — "The best data warehouse is a lakehouse", lmao.

Data

Data SQL Hadoop Deep Learning

Fundamentals of Apache Spark

Knowledge Hut

MAY 3, 2024

Spark offers over 80 high-level operators that make it easy to build parallel apps and one can use it interactively from the Scala, Python, R, and SQL shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. Basic knowledge of SQL. Yarn etc) Or, 2.

Hadoop

Hadoop Scala Healthcare Big Data

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Go to dataengineeringpodcast.com/97things today to get your copy!

Data Lake

Data Lake Data Warehouse Hadoop Architecture

A High Performance Platform For The Full Big Data Lifecycle

Data Engineering Podcast

AUGUST 19, 2019

One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Big Data

Big Data Hadoop Data Lake Media

Data News — Week 23.14

Christophe Blefari

APRIL 8, 2023

I was in the Hadoop world and all I was doing was denormalisation. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms. I still firmly believe that this is not the role of a data engineer. Data modeling should not be a required data engineer skill. YAML configured.

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

Data News — Week 13.14

Christophe Blefari

APRIL 8, 2023

I was in the Hadoop world and all I was doing was denormalisation. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms. I still firmly believe that this is not the role of a data engineer. Data modeling should not be a required data engineer skill. YAML configured.

Pipeline-centric

Pipeline-centric Database-centric Algorithm Data

How to install Apache Spark on Windows?

Knowledge Hut

MAY 2, 2024

It provides high-level APIs in Java, Scala, Python, and R and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. For Hadoop 2.7,

Java

Java Hadoop Scala SQL

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

In the early days, many companies simply used Apache Kafka ® for data ingestion into Hadoop or another data lake. Rockset supports JDBC and integrates with other SQL dashboards like Tableau, Grafana, and Apache Superset. However, Apache Kafka is more than just messaging. In the most critical use cases, every seconds counts.

Kafka

Kafka SQL BI Hadoop

Taking A Tour Of The Google Cloud Platform For Data And Analytics

Data Engineering Podcast

JUNE 11, 2021

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. No more scripts, just SQL.

Google Cloud

Google Cloud Cloud Big Data Ecosystem BI

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Podcast

FEBRUARY 11, 2018

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL? What impact has the 10.0 What impact has the 10.0

PostgreSQL

PostgreSQL NoSQL Google Cloud MongoDB

Data Engineering Weekly #154

Data Engineering Weekly

DECEMBER 24, 2023

I love the rising, stable, and declining format for categorizing data engineering trends. However, I’m less optimistic about the “multi-engine” orchestrator part. link] DoorDash: Privacy Engineering at DoorDash Drive DoorDash writes about its architecture and policy to protect user privacy at scale.

Data Engineering

Data Engineering Data Engineer Engineering Deep Learning

The Week of Data Conference Extravaganza: Databricks, Snowflake, LLM and the Future of Data Engineering

Data Engineering Weekly

JUNE 29, 2023

But hey, I met my friends after a long time and got my copy of “ Fundamentals of Data Engineering ” signed by Joe Reis & Matt Housely. If you’re starting data engineering, I highly recommend reading it. Snowflake is a DataLake Platform Snowflake is moving beyond a SQL data warehouse.

Data Engineering

Data Engineering Data Engineer Google Cloud Engineering

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?

MySQL

MySQL Scala Kafka Hadoop

Top 8 Interview Questions on Apache Sqoop

How to get started with dbt

How to learn data engineering

Webinars

Brief History of Data Engineering

Most Essential 2023 Interview Questions on Data Engineering

Engineering SQL Support on Apache Pinot at Uber

Reflecting On The Past 6 Years Of Data Engineering

Hadoop vs Spark: Main Big Data Tools Explained

Simplify Your Data Architecture With The Presto Distributed SQL Engine

The Rise of the Data Engineer

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Modern Customer Data Platform Principles

Upgrade your Modern Data Stack

Data Engineering Weekly #173

Maintain Your Data Engineers' Sanity By Embracing Automation

Data Scientist vs Data Engineer: Differences and Why You Need Both

Data Engineering Weekly #201

Data News — Week 24.08

A Reflection On Learning A Lot More Than 97 Things Every Data Engineer Should Know

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Databricks, Snowflake and the future

The Good and the Bad of Hadoop Big Data Framework

Best Morgan Stanley Data Engineer Interview Questions

Apache Spark vs MapReduce: A Detailed Comparison

Inside Agoda’s Private Cloud - Exclusive

Top 8 Hadoop Projects to Work in 2024

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Data Engineers of Netflix?—?Interview with Kevin Wylie

Big Data Technologies that Everyone Should Know in 2024

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Data News — Week 24.40

Fundamentals of Apache Spark

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

A High Performance Platform For The Full Big Data Lifecycle

Data News — Week 23.14

Data News — Week 13.14

How to install Apache Spark on Windows?

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Taking A Tour Of The Google Cloud Platform For Data And Analytics

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Weekly #154

The Week of Data Conference Extravaganza: Databricks, Snowflake, LLM and the Future of Data Engineering

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Stay Connected