Top Data Engineering Digest Data Workflow Data Content for June, 2024

June, 2024

Data Engineering Projects

Start Data Engineering

JUNE 14, 2024

1. Introduction 2. Run Data Pipelines 2.1. Run on codespaces 2.2. Run locally 3. Projects 3.1. Projects from least to most complex 3.2. Batch pipelines 3.3. Stream pipelines 3.4. Event-driven pipelines 3.5. LLM RAG pipelines 4. Conclusion 1. Introduction Whether you are new to data engineering or have been in the data field for a few years, one of the most challenging parts of learning new frameworks is setting them up!

Data Engineering

Data Engineering Data Engineer Project Engineering

What I’ve Learned After A Decade Of Data Engineering

Confessions of a Data Guy

JUNE 20, 2024

After 10 years of Data Engineering work, I think it’s time to hang up the proverbial hat and ride off into the sunset, never to be seen again. I wish. Everything has changed in 10 years, yet nothing has changed in 10 years, how is that even possible? Sometimes I wonder if I’ve learned anything […] The post What I’ve Learned After A Decade Of Data Engineering appeared first on Confessions of a Data Guy.

Data Engineering

Data Engineering Data Engineer Engineering Data

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Stitching Together Enterprise Analytics With Microsoft Fabric

Data Engineering Podcast

JUNE 23, 2024

Summary Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Using SQL with Python: SQLAlchemy and Pandas

KDnuggets

JUNE 12, 2024

A simple tutorial on how to connect to databases, execute SQL queries, and analyze and visualize data.

SQL

SQL Python Database Data

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Welcome to the snow world ( credits ) Every year, the competition between Snowflake and Databricks intensifies, using their annual conferences as a platform for demonstrating their power. This year, the Snowflake Summit was held in San Francisco from June 2 to 5, while the Databricks Data+AI Summit took place 5 days later, from June 10 to 13, also in San Francisco.

Metadata

Metadata Data Warehouse BI MySQL

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Jesse Anderson

JUNE 12, 2024

This video covers the latest announcements from StreamNative, Confluent, and WarpStream. We discuss communication protocols, how they’re used, and what they mean for you. We also discuss the various systems using Kafka’s protocol. Finally, we discuss the announcements about writing to Iceberg and DeltaLake directly from the broker and what that means for costs and operational ease.

Kafka

Kafka Data Lake Architecture Cloud

Why use Apache Airflow (or any orchestrator)?

Start Data Engineering

JUNE 24, 2024

1. Introduction 2. Features crucial to building and maintaining data pipelines 2.1. Schedulers to run data pipelines at specified frequency 2.2. Orchestrators to define the order of execution of your pipeline tasks 2.2.1. Define the order of execution of pipeline tasks with a DAG 2.2.2. Define where to run your code 2.2.3. Use operators to connect to popular services 2.3.

Data Pipeline

Data Pipeline Coding Building Data

More Trending

Why use Apache Airflow (or any orchestrator)?

Start Data Engineering

JUNE 24, 2024

Data Pipeline

Data Pipeline Coding Building Data

Open Sourcing Unity Catalog

databricks

JUNE 13, 2024

We are excited to announce that we are open sourcing Unity Catalog, the industry’s first open source catalog for data and AI governance.

Government

Government Data

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Podcast

JUNE 30, 2024

Summary This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems.

Pipeline-centric

Pipeline-centric Engineering Data Lake High Quality Data

5 Free Artificial Intelligence Courses from Top Universities

KDnuggets

JUNE 21, 2024

Want to learn AI from the best of resources? Check out these free AI courses from top universities.

OpenAI Acquires Rockset

Rockset

JUNE 21, 2024

I’m excited to share that OpenAI has completed the acquisition of Rockset. We are thrilled to join the OpenAI team and bring our technology and expertise to building safe and beneficial AGI. From the start, our vision at Rockset was to fundamentally transform the way data-driven applications were built. We developed our search and analytics database, taking full advantage of the cloud, to eliminate the complexity inherent in the data infrastructure needed for these apps.

Database

Database Cloud Accessible Accessibility

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg

Snowflake

JUNE 3, 2024

Open source file and table formats have garnered much interest in the data industry because of their potential for interoperability — unlocking the ability for many technologies to safely operate over a single copy of data. Greater interoperability not only reduces the complexity and costs associated with using many tools and processing engines in parallel, but it would also reduce potential risks associated with vendor lock-in.

Amazon Web Services

Amazon Web Services Google Cloud Data Architect Government

Robinhood to Acquire Bitstamp

Robinhood

JUNE 6, 2024

This acquisition will bring Bitstamp’s globally-scaled crypto exchange to Robinhood, with retail and institutional customers across the EU, UK, US and Asia. This strategic combination better positions Robinhood to expand outside of the US and will bring a trusted and reputable institutional business to Robinhood. Expected to close in the first half of 2025, subject to customary closing conditions, including regulatory approvals.

Retail

Retail Systems Process Management

Mosaic AI: Build and deploy production-quality Compound AI Systems

databricks

JUNE 12, 2024

Over the last year, we have seen a surge of commercial and open-source foundation models showing strong reasoning abilities on general knowledge tasks.

Systems

Systems Building Data Science Data

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

Summary Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Data Lake

Data Lake High Quality Data Metadata Machine Learning

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Deploying Machine Learning Models: A Step-by-Step Tutorial

KDnuggets

JUNE 20, 2024

Image by author Model deployment is the process of trained models being integrated into practical applications. This includes defining the necessary environment, specifying how input data is introduced into the model and the output produced, and the capacity to analyze new data and provide relevant predictions or categorizations.

Machine Learning

Machine Learning Process Data

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we’ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model training has involved a training massive number of models that required a comparatively smaller number of GPUs.

Algorithm

Algorithm Data Storage Technology Building

Infoshare 2024 - Retrospective

Waitingforcode

JUNE 24, 2024

Last May I gave a talk about stream processing fallacies at Infoshare in Gdansk. Besides this speaking experience, I was also - and maybe among others - an attendee who enjoyed several talks in software and data engineering areas. I'm writing this blog post to remember them and why not, share the knowledge with you!

Data Engineering

Data Engineering Data Engineer Engineering Process

Is Python OOP the Devil? Or Savior?

Confessions of a Data Guy

JUNE 3, 2024

Nothing will raise the hackles on the backs of hairy and pale programmers who’ve been stuck in their mom’s basement for a decade like bringing up OOP (Object Oriented Programming), especially in the context of Python. It’s like two fattened calves prepared for slaughter, sharpen your knives, and take your place, it’s time to feast […] The post Is Python OOP the Devil?

Python

Python Programming IT Data

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineer

Introducing AI/BI: Intelligent Analytics for Real-World Data

databricks

JUNE 11, 2024

Today, we are excited to announce Databricks AI/BI , a new type of business intelligence product built from the ground up to deeply.

BI Business Intelligence Data

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness

Process

Process Data Lake High Quality Data Machine Learning

5 Free University Courses to Learn Coding for Data Science

KDnuggets

JUNE 12, 2024

Learn programming for free from top-tier universities like Harvard and MIT.

Data Science

Data Science Coding Programming Data

Maintaining large-scale AI capacity at Meta

Engineering at Meta

JUNE 12, 2024

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inflection point due to the rise of artificial intelligence (AI), we recognized that to lead in the generative AI space we’d need to transform our fleet.

Utilities

Utilities Media Designing Engineering

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Delta Lake table as a changelog

Waitingforcode

JUNE 17, 2024

One of the big challenges in streaming Delta Lake is the inability to handle in-place changes, like updates, deletes, or merges. There is good news, though. With a little bit of effort on your data provider's side, you can process a Delta Lake table as you would process Apache Kafka topics, hence without in-place changes.

Kafka

Kafka Process Data

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Snowflake

JUNE 5, 2024

Python’s popularity has grown significantly, quickly becoming the preferred language for development across machine learning, application development, pipelines and more. At Snowflake we are deeply committed to delivering a best-in-class platform for Python developers. In line with this commitment, we’re thrilled to announce the public preview support of Snowpark pandas API, enabling seamless execution of distributed pandas at scale in Snowflake.

Python

Python Programming Language Government SQL

Databricks + Tabular

databricks

JUNE 3, 2024

We are excited to announce that we have agreed to acquire Tabular, Inc, a data management company founded by Ryan Blue, Daniel Weeks.

Data Management

Data Management Management Data

Practical First Steps In Data Governance For Long Term Success

Data Engineering Podcast

JUNE 2, 2024

Summary Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business.

Data Governance

Data Governance Government Data Lake High Quality Data

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

FastAPI Tutorial: Build APIs with Python in Minutes

KDnuggets

JUNE 13, 2024

Want to build APIs with Python? Learn how to do so using FastAPI with this step-by-step tutorial.

Python

Python Building

AI-Enhanced User Experiences in ArcGIS Pro 3.3

ArcGIS

JUNE 3, 2024

Learn about the new AI-enhanced user experiences for geoprocessing in ArcGIS Pro 3.3, including semantic search and tool suggestions.

Infoshare 2024: Stream processing fallacies, part 2

Waitingforcode

JUNE 5, 2024

The blog shares the last fallacies for my 7 years stream processing journey.

Process

Recognizing Customer-Focused Innovation at Partner Summit 2024: Announcing the Global Snowflake Partners of the Year

Snowflake

JUNE 5, 2024

Each year, we are humbled and honored to look back on the contributions from the Snowflake Partner Network (SPN) and recognize their hard work with the Snowflake Partner Awards. Our partners help drive customer success and build an ever-expanding open ecosystem of solutions built on the AI Data Cloud. In the midst of this year’s AI Data Cloud Summit , we announced the 2024 Snowflake Partner Awards, recognizing 36 partners that are winning together with Snowflake and honoring them for their conti

Entertainment

Entertainment Manufacturing Retail Healthcare

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

June, 2024

Data Engineering Projects

What I’ve Learned After A Decade Of Data Engineering

Webinars

Trending Sources

Stitching Together Enterprise Analytics With Microsoft Fabric

Webinars

Using SQL with Python: SQLAlchemy and Pandas

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Databricks, Snowflake and the future

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Why use Apache Airflow (or any orchestrator)?

Sign up to get articles personalized to your interests!

More Trending

Why use Apache Airflow (or any orchestrator)?

Open Sourcing Unity Catalog

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

5 Free Artificial Intelligence Courses from Top Universities

OpenAI Acquires Rockset

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg

Robinhood to Acquire Bitstamp

Mosaic AI: Build and deploy production-quality Compound AI Systems

Being Data Driven At Stripe With Trino And Iceberg

How to Modernize Manufacturing Without Losing Control

Deploying Machine Learning Models: A Step-by-Step Tutorial

How Meta trains large language models at scale

Infoshare 2024 - Retrospective

Is Python OOP the Devil? Or Savior?

The Ultimate Guide to Apache Airflow DAGS

Introducing AI/BI: Intelligent Analytics for Real-World Data

X-Ray Vision For Your Flink Stream Processing With Datorios

5 Free University Courses to Learn Coding for Data Science

Maintaining large-scale AI capacity at Meta

Optimizing The Modern Developer Experience with Coder

Delta Lake table as a changelog

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Databricks + Tabular

Practical First Steps In Data Governance For Long Term Success

15 Modern Use Cases for Enterprise Business Intelligence

FastAPI Tutorial: Build APIs with Python in Minutes

AI-Enhanced User Experiences in ArcGIS Pro 3.3

Infoshare 2024: Stream processing fallacies, part 2

Recognizing Customer-Focused Innovation at Partner Summit 2024: Announcing the Global Snowflake Partners of the Year

Apache Airflow® Best Practices: DAG Writing

Stay Connected