Top Data Engineering Digest Content for July, 2023

July, 2023

AI Image Generation Explained: Techniques, Applications, and Limitations

AltexSoft

JULY 10, 2023

Imagine walking through an art exhibition at the renowned Gagosian Gallery , where paintings seem to be a blend of surrealism and lifelike accuracy. One particular piece catches your eye: it depicts a child staring at the viewer with wind-tossed hair, evoking the feel of the Victorian era through its coloring and what appears to be a simple linen dress.

Medical

Medical Datasets Algorithm Entertainment

Data Engineer vs Data Scientist: Which Career to Choose?

Analytics Vidhya

JULY 25, 2023

In the world of data, two crucial roles play a significant part in unlocking the power of information: Data Scientists and Data Engineers. But what sets these wizards of data apart? Welcome to the ultimate showdown of Data Scientist vs Data Engineer! In this captivating journey, we’ll explore the distinctive paths these tech titans take […] The post Data Engineer vs Data Scientist: Which Career to Choose?

Data Engineer

Data Engineer Data Engineering Engineering Data

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Twitter vs Instagram Threads: two different approaches to throttling

The Pragmatic Engineer

JULY 6, 2023

Originally published 6 July 2023 👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of six topics in today’s subscriber-only The Scoop issue. If you’re not yet a full subscriber, you missed this week’s deep-dive on What a senior engineer is at Big Tech. To get the full issues twice a week, subscribe here.

Google Cloud

Google Cloud Media Cloud Utilities

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Polars vs Pandas. Inside an AWS Lambda.

Confessions of a Data Guy

JULY 22, 2023

Nothing gives me greater joy than rocking the boat. I take pleasure in finding what people love most in tech and trying to poke holes in it. Everything is sacred. Nothing is sacred. I also enjoy doing simple things, things that have a “real-life” feel to them. I suppose I could be like the others […] The post Polars vs Pandas. Inside an AWS Lambda. appeared first on Confessions of a Data Guy.

AWS

AWS Data IT Data Engineering

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

A Tour Around Buck2, Meta's New Build System

Tweag

JULY 5, 2023

Meta recently announced they have made Buck2 open-source. Buck2 is a from-scratch rewrite of Buck , a polyglot, monorepo build system that was developed and used at Meta (Facebook), and shares a few similarities with Bazel. As you may know, the Scalable Builds Group at Tweag has a strong interest in such scalable build systems. We were thrilled to have the opportunity to work with Meta on Buck2 to help make the tool useful and successful in the open-source use case.

Systems

Systems Building Java Programming Language

Strategies For A Successful Data Platform Migration

Data Engineering Podcast

JULY 30, 2023

Summary All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

Machine Learning

Machine Learning SQL Python Data

Data News — mid-2023 popular articles

Christophe Blefari

JULY 28, 2023

🧜‍♂️ ( credits ) Hey, this is a mid-2023 edition with some of my favourite articles and the popular articles that have been shared this year in the newsletter. There isn't any fancy calculation on how to find the popular articles. Here how it's done. Every link sent in each newsletter is tracked in 2 ways: when you click on a link it first redirect you to my blog so I know that you've clicked on it it adds ref=blef.fr to the url, so the original articl

Data

Data SQL Python Programming

More Trending

Data News — mid-2023 popular articles

Christophe Blefari

JULY 28, 2023

Data

Data SQL Python Programming

Getting Started with Amazon SageMaker Ground Truth

Analytics Vidhya

JULY 6, 2023

Introduction In this era of Generative Al, data generation is at its peak. Building an accurate machine learning and AI model requires a high-quality dataset. The quality assurance of the dataset is the most critical task, as poor data causes inaccurate analytics and unidentified predictions that can affect the entire repo of any business and […] The post Getting Started with Amazon SageMaker Ground Truth appeared first on Analytics Vidhya.

Datasets

Datasets Machine Learning Building Algorithm

Building an an Early Stage Startup: Lessons from Akita Software

The Pragmatic Engineer

JULY 20, 2023

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of five topics from today’s subscriber-only deepdive on Advice on how to sell a startup. To get full issues twice a week, subscribe here.

Building

Building Programming Language Programming Python

State expiration in stream-to-stream joins with event time range condition

Waitingforcode

JULY 25, 2023

You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.

Data Engineering Best Practices - #1. Data flow & Code

Start Data Engineering

JULY 20, 2023

1. Introduction 2. Sample project 3. Best practices 3.1. Use standard patterns that progressively transform your data 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks) 3.3. Avoid data duplicates with idempotent pipelines 3.4. Write DRY code & keep I/O separate from data transformation 3.5. Know the when, how, & what (aka metadata) of pipeline runs for easier debugging 3.

Coding

Coding Data Engineer Data Engineering Engineering

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Build Real Time Applications With Operational Simplicity Using Dozer

Data Engineering Podcast

JULY 23, 2023

Summary Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.

Building

Building Machine Learning SQL Python

Data News — Week 23.29

Christophe Blefari

JULY 22, 2023

See you on the road ( credits ) Hey, I hope this newsletter finds you well. This is a small blogpost to give you a few reads while waiting for your next travel. We can already feel summer, I found less articles to enter the selection this week. Also be ready for the Data News: Summer Edition. For the next 5 releases it will be a bit different than usual, less curation and more original articles written in advance to allow me to take a break.

Data

Data Database Big Data Data Engineer

4 Alternatives to Fivetran: The Evolving Dynamics of the ETL & ELT Tool Market

Seattle Data Guy

JULY 16, 2023

The ETL & ELT tool market is experiencing continuous transformation, propelled by fluctuating pricing structures and the advent of inventive alternatives. This industry remains fiercely competitive due to these changing elements and a swiftly growing user base. In the following sections, we will explore four emerging alternatives to Fivetran. Of course, that is if you… Read more The post 4 Alternatives to Fivetran: The Evolving Dynamics of the ETL & ELT Tool Market appeared first

Data Warehouse

Data Warehouse Data Consulting Data Engineering

The Pulse: VanMoof files for bankruptcy protection

The Pragmatic Engineer

JULY 13, 2023

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of six topics in today’s subscriber-only The Pulse issue. If you’re not yet a full subscriber, you missed this week’s deep-dive on Software architect archetypes. To get the full issues, twice a week, subscribe here. Before we start, a small change.

Google Cloud

Google Cloud Retail Manufacturing Cloud

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Waitingforcode

JULY 21, 2023

Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!

Ballista (Rust) vs Apache Spark. A Tale of Woe.

Confessions of a Data Guy

JULY 7, 2023

Sometimes it seems like the Data Engineering landscape is starting to shoot off into infinity. With the rise of Rust, new tools like DuckDB, Polars, and whatever else, things do seem to shifting at a fundamental level. It seems like there is someone at the base of a titering rock with a crowbar, picking and […] The post Ballista (Rust) vs Apache Spark.

Data Engineer

Data Engineer Data Engineering Engineering IT

Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future

Data Engineering Podcast

JULY 16, 2023

Summary Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.

SQL

SQL Machine Learning Python Data Engineer

Data News — Week 23.28

Christophe Blefari

JULY 15, 2023

Have fun train models on this ( credits ) Hey, it's Saturday I hope you're enjoying July, taking deserve break, reading data engineering articles while at the beach or traveling to unknown places. Sometimes there are Fridays when I don't find any glue between articles for the newsletter and I have an idea of something to compensate but it takes me the whole Friday of exploration.

Datasets

Datasets Python Machine Learning Data

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

What Is Change Data Capture

Seattle Data Guy

JULY 9, 2023

Some data teams need to have their data near real-time for dashboards and reporting. So how can they implement a near real-time data pipeline? One possible choice is a method called change data capture, also known as CDC. I have seen companies employ multiple ways to use CDC or CDC-like approaches to pull data from… Read more The post What Is Change Data Capture appeared first on Seattle Data Guy.

Data Pipeline

Data Pipeline Data Consulting Big Data

Reinforcement Learning: Teaching Computers to Make Optimal Decisions

KDnuggets

JULY 7, 2023

Reinforcement learning basics to get your feet wet. Learn the components and key concepts in the reinforcement loading framework: from agents and rewards to value functions, policy, and more.

Machine Learning

Berlin Buzzwords 2023 - notes for data engineers

Waitingforcode

JULY 13, 2023

That's the conference I've heard only recently about. What a huge mistake! Despite the lack of "data" word in the name, it covers many interesting data topics and before I share with you my notes from this year's Data+AI Summit, let me do the same for Berlin Buzzwords!

Data Engineer

Data Engineer Data Engineering Engineering Data

The Executive’s Guide to Data, Analytics and AI Transformation, Part 6: Allocate, monitor and optimize costs

databricks

JULY 6, 2023

This is part six of a multi-part series to share key insights and tactics with Senior Executives leading data and AI transformation initiatives.

Data Analytics

Data Analytics Data

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Data Engineering Podcast

JULY 9, 2023

Summary For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.

Database-centric

Database-centric Machine Learning SQL Data Engineer

Data News — Week 23.27

Christophe Blefari

JULY 8, 2023

Who's leading the data peloton? ( credits ) Hey you, this is the Saturday Data News edition 🥲 Time flies. I'm working for the Series of articles in advance for August about "creating data platforms" and I'm looking for ideas about the data I could use for this. Having some kind of simulated real-time data would be the best.

Kafka

Kafka PostgreSQL Data Transportation

How to make features illuminate an underlying basemap

ArcGIS

JULY 26, 2023

Sure, we can make features look like they are glowing. But how can we make them look like they are casting light on the basemap below?

Designing

How to Build a Streaming Semi-structured Analytics Platform on Snowflake

KDnuggets

JULY 1, 2023

Building a datalake for semi-structured data or json has always been challenging. Imagine if the json documents are streaming or continuously flowing from healthcare vendors then we need a robust modern architecture that can deal with such a high volume. At the same time analytics layer also needs to be created so as to generate value from it.

Building

Building Healthcare Structured Data Architecture

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Multiple queries running in Apache Spark Structured Streaming

Waitingforcode

JULY 6, 2023

That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.

Data

Introducing Databricks Assistant, a context-aware AI assistant

databricks

JULY 31, 2023

Today, we are excited to announce the public preview of Databricks Assistant, a context-aware AI assistant, available natively in Databricks Notebooks, SQL editor.

SQL

How Data Engineering Teams Power Machine Learning With Feature Platforms

Data Engineering Podcast

JULY 2, 2023

Summary Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.

Machine Learning

Machine Learning Data Engineer Data Engineering Engineering

Data News — Snowflake and Databricks summits

Christophe Blefari

JULY 3, 2023

2 summits ( credits I cropped the image) Hey, since I said I should try to send the newsletter at a specific schedule I did not. Haha. Still here the newsletter for last week. This is a small wrap-up from the Snowflake and Databricks Data + AI summits which have taken place last week. There are so many sessions at both summits that this is impossible to watch everything, more Databricks and Snowflake do not put in free access online everything so I can't wait everything.

SQL

SQL Data Kafka AWS

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

July, 2023

AI Image Generation Explained: Techniques, Applications, and Limitations

Data Engineer vs Data Scientist: Which Career to Choose?

Webinars

Trending Sources

Twitter vs Instagram Threads: two different approaches to throttling

Webinars

Polars vs Pandas. Inside an AWS Lambda.

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

A Tour Around Buck2, Meta's New Build System

Strategies For A Successful Data Platform Migration

Data News — mid-2023 popular articles

Sign up to get articles personalized to your interests!

More Trending

Data News — mid-2023 popular articles

Getting Started with Amazon SageMaker Ground Truth

Building an an Early Stage Startup: Lessons from Akita Software

State expiration in stream-to-stream joins with event time range condition

Data Engineering Best Practices - #1. Data flow & Code

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Build Real Time Applications With Operational Simplicity Using Dozer

Data News — Week 23.29

4 Alternatives to Fivetran: The Evolving Dynamics of the ETL & ELT Tool Market

The Pulse: VanMoof files for bankruptcy protection

How to Modernize Manufacturing Without Losing Control

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Ballista (Rust) vs Apache Spark. A Tale of Woe.

Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future

Data News — Week 23.28

The Ultimate Guide to Apache Airflow DAGS

What Is Change Data Capture

Reinforcement Learning: Teaching Computers to Make Optimal Decisions

Berlin Buzzwords 2023 - notes for data engineers

The Executive’s Guide to Data, Analytics and AI Transformation, Part 6: Allocate, monitor and optimize costs

Optimizing The Modern Developer Experience with Coder

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Data News — Week 23.27

How to make features illuminate an underlying basemap

How to Build a Streaming Semi-structured Analytics Platform on Snowflake

15 Modern Use Cases for Enterprise Business Intelligence

Multiple queries running in Apache Spark Structured Streaming

Introducing Databricks Assistant, a context-aware AI assistant

How Data Engineering Teams Power Machine Learning With Feature Platforms

Data News — Snowflake and Databricks summits

Apache Airflow® Best Practices: DAG Writing

Stay Connected