Top Data Engineering Digest Data Engineer Data Engineering Content for Week of Feb 25

Sat.Feb 25, 2023 - Fri.Mar 03, 2023

AWS Lambdas – Python vs Rust. Performance and Cost Savings.

Confessions of a Data Guy

FEBRUARY 26, 2023

Save money, save money!! Hear Hear! Someone on Linkedin recently brought up the point that companies could save gobs of money by swapping out AWS Python lambdas for Rust ones. While it raised the ire of many a Python Data Engineer, I thought it sounded like a great idea. At least it’s an excuse to […] The post AWS Lambdas – Python vs Rust.

AWS

AWS Python Data Engineer Data Engineering

Azure Databricks: A Comprehensive Guide

Analytics Vidhya

FEBRUARY 28, 2023

Introduction Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform that is built on top of the Microsoft Azure cloud. A collaborative and interactive workspace allows users to perform big data processing and machine learning tasks easily. In this blog post, we will take a closer look at Azure Databricks, its key features, […] The post Azure Databricks: A Comprehensive Guide appeared first on Analytics Vidhya.

Big Data

Big Data Machine Learning Cloud Data Process

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Finding My Pathless Path

Simon Späti

FEBRUARY 25, 2023

As I sit down to write this article, I’m filled with a sense of vulnerability and excitement. You see, this is a story that only I can tell. It’s a tale of finding my Pathless Path and discovering who I am in the process. I have learned that some of my best decision-making comes from following my gut, heart, and intuition, a place of inner knowing.

Process

Process IT

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

This article is meant to be a resource hub in order to understand dbt basics and to help get started your dbt journey. When I write dbt, I often mean dbt Core. dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt Core has been developed by dbt Labs, which was previously named Fishtown Analytics. The company has been founded in May 2016. dbt Labs also develop dbt Cloud which is a cloud product that hosts and runs dbt Core projects.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Big Tech job-switching stats

The Pragmatic Engineer

MARCH 3, 2023

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics from The Scoop #39 , published two weeks ago, 23 February. To get full newsletters twice a week, subscribe here. I have collaborated with a tech recruiter - they’ve asked to be anonymous - who’s been running some very interesting queries on LinkedIn for software engineers.

Recruitment

Recruitment Software Engineer Software Engineering Finance

30 Best Data Science Books to Read in 2023

Analytics Vidhya

FEBRUARY 28, 2023

Introduction Data science has taken over all economic sectors in recent times. To achieve maximum efficiency, every company strives to use various data at every stage of its operations. Each aspect of data science, like data preparation, the importance of big data, and the process of automation, contributes to how data science is the future […] The post 30 Best Data Science Books to Read in 2023 appeared first on Analytics Vidhya.

Data Science

Data Science Data Preparation Big Data Data

ChatGPT for Data Science Cheat Sheet

KDnuggets

MARCH 2, 2023

The latest KDnuggets cheat sheet covers using ChatGPT to your advantage as a data scientist. It's time to master prompt engineering, and here is a handy reference for helping you along the way.

Data Science

Data Science Data Engineering IT

More Trending

ChatGPT for Data Science Cheat Sheet

KDnuggets

MARCH 2, 2023

The latest KDnuggets cheat sheet covers using ChatGPT to your advantage as a data scientist. It's time to master prompt engineering, and here is a handy reference for helping you along the way.

Data Science

Data Science Data Engineering IT

Building A Data Mesh Platform At PayPal

Data Engineering Podcast

FEBRUARY 26, 2023

Summary There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.

Building

Building Machine Learning Metadata Data Integration

Why did Google close its coding competitions after 20 years?

The Pragmatic Engineer

MARCH 3, 2023

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics in yesterday's subscriber-only The Scoop issue. To get full newsletters twice a week, subscribe here. On 22 February 2023, Google announced its coding competitions are coming to an end: The visual that accompanied the announcement of the end of Google’s coding competitions.

Coding

Coding IT Software Engineer Software Engineering

How to Normalize Relational Databases With SQL Code?

Analytics Vidhya

FEBRUARY 27, 2023

Introduction Data is the new oil in this century. The database is the major element of a data science project. To generate actionable insights, the database must be centralized and organized efficiently. If a corrupted, unorganized, or redundant database is used, the results of the analysis may become inconsistent and highly misleading. So, we are […] The post How to Normalize Relational Databases With SQL Code?

Relational Database

Relational Database Database SQL Coding

Top Free Data Science Online Courses for 2023

KDnuggets

MARCH 2, 2023

Learn Data Science in 2023 for FREE with these online courses.

Data Science

Data Science Data

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Introducing Compute-Compute Separation for Real-Time Analytics

Rockset

MARCH 1, 2023

Every database built for real-time analytics has a fundamental limitation. When you deconstruct the core database architecture, deep in the heart of it you will find a single component that is performing two distinct competing functions: real-time data ingestion and query serving. These two parts running on the same compute unit is what makes the database real-time: queries can reflect the effect of the new data that was just ingested.

Data Ingestion

Data Ingestion Database Architecture SQL

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

MARCH 2, 2023

Snowflake enables organizations to be data-driven by offering an expansive set of features for creating performant, scalable, and reliable data pipelines that feed dashboards, machine learning models, and applications. But before data can be transformed and served or shared, it must be ingested from source systems. The volume of data generated in real time from application databases, sensors, and mobile devices continues to grow exponentially.

Kafka

Kafka Data Ingestion Data Pipeline Cloud Storage

PySpark for Data Science

KDnuggets

FEBRUARY 27, 2023

In this tutorial, we will learn to Initiates the Spark session, load, and process the data, perform data analysis, and train a machine learning model.

Data Science

Data Science Machine Learning Data Analysis Data

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

What is a Data Mesh?

Confessions of a Data Guy

MARCH 2, 2023

The post What is a Data Mesh? appeared first on Confessions of a Data Guy.

Data

Data Big Data Data Engineering Data Engineer

Filtering rules accumulator

Waitingforcode

MARCH 1, 2023

Data can have various quality issues, from missing to badly formatted values. However, there is another issue less people talk about, the erroneous filtering logic.

Data

Understanding Dimensional Modeling

Analytics Vidhya

FEBRUARY 28, 2023

Introduction One of the most important assets of any organization is the data it produces on a daily basis. This data is used by an organization to find valuable insights which help in improving an organization’s growth and strategies and give them an upper hand over its competitors. This article explains to you the idea […] The post Understanding Dimensional Modeling appeared first on Analytics Vidhya.

Data Warehouse

Data Warehouse IT Data Database

Finding My Pathless Path

Simon Späti

FEBRUARY 25, 2023

Process

Process IT

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Announcing Ray support on Databricks and Apache Spark Clusters

databricks

FEBRUARY 27, 2023

Ray is a prominent compute framework for running scalable AI and Python workloads, offering a variety of distributed machine learning tools, large-scale hyperparameter.

Machine Learning

Machine Learning Python Engineering

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. Iceberg is a high-performance open table format for huge analytic data sets. It allows multiple data processing engines, such as Flink, NiFi, Spark, Hive, and Impala to access and analyze data in simple, familiar SQL tables. In this blog post, we are going to share with you how Cloudera Stream Processing ( CSP ) is integrated with Apache Iceberg and how you can use the SQL Stream

Process

Process SQL Kafka Database

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

3 Julia Packages for Data Visualization

KDnuggets

FEBRUARY 28, 2023

A gentle introduction of Plots.jl, Gadfly.jl, and VegaLite with code examples.

Coding

Coding Data Programming

Scalable Spark Structured Streaming for REST API Destinations

databricks

MARCH 1, 2023

Spark Structured Streaming is the widely-used open source engine at the foundation of data streaming on the Databricks Lakehouse Platform. It can elegantly.

Engineering

Engineering IT Data

Step-by-Step Roadmap to Learn SQL in 2023

Analytics Vidhya

FEBRUARY 28, 2023

Introduction Structured Query Language is a powerful language to manage and manipulate data stored in databases. SQL is widely used in the field of data science and is considered an essential skill to have if you work with data. After being introduced in the 70s, it has become the standard querying language for relational databases. […] The post Step-by-Step Roadmap to Learn SQL in 2023 appeared first on Analytics Vidhya.

SQL

SQL Relational Database Data Science Database

Upsert your datasets using the Append tool in ArcGIS Pro 3.1

ArcGIS

FEBRUARY 27, 2023

In ArcGIS Pro 3.1, you can use the Append tool to upsert (update and insert) a target dataset with data from a new or updated dataset.

Datasets

Datasets Data Data Management Government

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

5 Data Analysis Projects For Beginners

KDnuggets

FEBRUARY 28, 2023

Are you a data analyst newbie looking to boost your resume to land your first job? If yes, then up your game as a beginner with these 5 projects that you can’t afford to miss.

Data Analysis

Data Analysis Project Data Data Science

Multi-Geo Replication 101 for Apache Kafka: The What, How, and Why

Confluent

FEBRUARY 27, 2023

Learn the what, how, and why for multi-geo replication. In this post, we’ll share the best tools, practices, and patterns for planning geo-replicated Kafka deployments.

Kafka

Understanding the Basics of Database Normalization

Analytics Vidhya

MARCH 2, 2023

Introduction Data normalization is the process of building a database according to what is known as a canonical form, where the final product is a relational database with no data redundancy. More specifically, normalization involves organizing data according to attributes assigned as part of a larger data model. The main goals of database normalization are […] The post Understanding the Basics of Database Normalization appeared first on Analytics Vidhya.

Database

Database Relational Database Building Process

GitHub’s CoPilot Writes Data Pipelines

Confessions of a Data Guy

MARCH 1, 2023

The post GitHub’s CoPilot Writes Data Pipelines appeared first on Confessions of a Data Guy.

Data Pipeline

Data Pipeline Data

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m

Software Engineer

Sat.Feb 25, 2023 - Fri.Mar 03, 2023

AWS Lambdas – Python vs Rust. Performance and Cost Savings.

Azure Databricks: A Comprehensive Guide

Webinars

Trending Sources

Finding My Pathless Path

Webinars

How to get started with dbt

A Guide to Debugging Apache Airflow® DAGs

Big Tech job-switching stats

30 Best Data Science Books to Read in 2023

ChatGPT for Data Science Cheat Sheet

Sign up to get articles personalized to your interests!

More Trending

ChatGPT for Data Science Cheat Sheet

Building A Data Mesh Platform At PayPal

Why did Google close its coding competitions after 20 years?

How to Normalize Relational Databases With SQL Code?

Top Free Data Science Online Courses for 2023

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Introducing Compute-Compute Separation for Real-Time Analytics

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Top 10 Hadoop Interview Questions You Must Know

PySpark for Data Science

Agent Tooling: Connecting AI to Your Tools, Systems & Data

What is a Data Mesh?

Filtering rules accumulator

Understanding Dimensional Modeling

Finding My Pathless Path

How to Modernize Manufacturing Without Losing Control

Top Posts February 20-26: 5 SQL Visualization Tools for Data Engineers

Announcing Ray support on Databricks and Apache Spark Clusters

Top 5 Interview Questions on Cassandra

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

The Ultimate Guide to Apache Airflow DAGS

3 Julia Packages for Data Visualization

Scalable Spark Structured Streaming for REST API Destinations

Step-by-Step Roadmap to Learn SQL in 2023

Upsert your datasets using the Append tool in ArcGIS Pro 3.1

Apache Airflow® Best Practices: DAG Writing

5 Data Analysis Projects For Beginners

Multi-Geo Replication 101 for Apache Kafka: The What, How, and Why

Understanding the Basics of Database Normalization

GitHub’s CoPilot Writes Data Pipelines

How to Achieve High-Accuracy Results When Using LLMs

Stay Connected