Top Data Engineering Digest Data Integration Data Collection Content for Week of Jun 15

Sat.Jun 15, 2024 - Fri.Jun 21, 2024

What I’ve Learned After A Decade Of Data Engineering

Confessions of a Data Guy

JUNE 20, 2024

After 10 years of Data Engineering work, I think it’s time to hang up the proverbial hat and ride off into the sunset, never to be seen again. I wish. Everything has changed in 10 years, yet nothing has changed in 10 years, how is that even possible? Sometimes I wonder if I’ve learned anything […] The post What I’ve Learned After A Decade Of Data Engineering appeared first on Confessions of a Data Guy.

Data Engineering

Data Engineering Data Engineer Engineering Data

5 Free Artificial Intelligence Courses from Top Universities

KDnuggets

JUNE 21, 2024

Want to learn AI from the best of resources? Check out these free AI courses from top universities.

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Welcome to the snow world ( credits ) Every year, the competition between Snowflake and Databricks intensifies, using their annual conferences as a platform for demonstrating their power. This year, the Snowflake Summit was held in San Francisco from June 2 to 5, while the Databricks Data+AI Summit took place 5 days later, from June 10 to 13, also in San Francisco.

Metadata

Metadata Data Warehouse BI MySQL

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

Summary Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Data Lake

Data Lake High Quality Data Metadata Machine Learning

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

OpenAI Acquires Rockset

Rockset

JUNE 21, 2024

I’m excited to share that OpenAI has completed the acquisition of Rockset. We are thrilled to join the OpenAI team and bring our technology and expertise to building safe and beneficial AGI. From the start, our vision at Rockset was to fundamentally transform the way data-driven applications were built. We developed our search and analytics database, taking full advantage of the cloud, to eliminate the complexity inherent in the data infrastructure needed for these apps.

Database

Database Cloud Accessibility Accessible

Deploying Machine Learning Models: A Step-by-Step Tutorial

KDnuggets

JUNE 20, 2024

Image by author Model deployment is the process of trained models being integrated into practical applications. This includes defining the necessary environment, specifying how input data is introduced into the model and the output produced, and the capacity to analyze new data and provide relevant predictions or categorizations.

Machine Learning

Machine Learning Process Data

Delta Lake table as a changelog

Waitingforcode

JUNE 17, 2024

One of the big challenges in streaming Delta Lake is the inability to handle in-place changes, like updates, deletes, or merges. There is good news, though. With a little bit of effort on your data provider's side, you can process a Delta Lake table as you would process Apache Kafka topics, hence without in-place changes.

Kafka

Kafka Process Data

More Trending

Delta Lake table as a changelog

Waitingforcode

JUNE 17, 2024

Kafka

Kafka Process Data

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

JUNE 21, 2024

Thousands of customers have worked with Snowflake to cost-effectively build a secure data foundation as they look to solve a growing variety of business problems with more data. Increasingly customers are looking to expand that powerful foundation to a broader set of data across their enterprise. Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking f

Data Lake

Data Lake BI Business Intelligence Metadata

Databricks Named a Leader in 2024 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms

databricks

JUNE 20, 2024

We are excited to announce that Gartner has recognized Databricks as a Leader in the 2024 Gartner® Magic Quadrant™ for Data Science and.

Data Science

Data Science Machine Learning Data

Creating AI-Driven Solutions: Understanding Large Language Models

KDnuggets

JUNE 20, 2024

Understanding LLMs is pivotal in unlocking the full potential of AI-driven solutions across various domains. As we navigate the process of building AI-driven solutions, it is essential to approach the development and deployment of LLMs with a focus on responsible AI practices.

Building

Building Process IT

Boost your Productivity with Tool Parameter Overrides in ArcGIS Pro 3.3

ArcGIS

JUNE 17, 2024

Productivity Update! Learn how to override default parameter values for geoprocessing tools in ArcGIS Pro 3.3. Override Geoprocessing Tool Defaults in ArcGIS Pro 3.

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

The Future of Telecoms: Embracing Gen AI as a Strategic Competitive Advantage

Snowflake

JUNE 20, 2024

The telecom industry is undergoing an unprecedented transformation. Fueled by tech advancements such as 5G, cloud computing, Internet of Things (IoT) and machine learning (ML), telecoms have the opportunity to reshape and streamline operations and make significant improvements in service delivery, customer experience and network optimization. Key to these technologies is generative AI (gen AI), a dynamic form of artificial intelligence that leverages vast amounts of data to analyze and produce r

Cloud Computing

Cloud Computing Machine Learning Technology Cloud

Cloudera Unveils Plans for Annual Pride Celebration in Cork

Cloudera

JUNE 17, 2024

Pride Month is underway and we at Cloudera are looking forward to joining the global celebration of diversity, equity and the ongoing effort for LGBTQ+ ( L esbian, G ay, B isexual, T ransgender, Q ueer/ Q uestioning) rights and recognition. Pride Month serves as a reminder that the fight for equality and equity for members of the LGBTQ+ community is not over.

Systems

Systems Building IT

A Simple to Implement End-to-End Project with HuggingFace

KDnuggets

JUNE 21, 2024

Generating a ready-to-use HuggingFace model with FastAPI and Docker

Project

Santalucía Seguros: Enterprise-level RAG for Enhanced Customer Service and Agent Productivity

databricks

JUNE 21, 2024

In the insurance sector, customers demand personalized, fast, and efficient service that addresses their needs. Meanwhile, insurance agents must access a large amount.

Insurance

Insurance Accessibility Accessible

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Snowflake

JUNE 20, 2024

In the age of AI, enterprises are increasingly looking to extract value from their data at scale but often find it difficult to establish a scalable data engineering foundation that can process the large amounts of data required to build or improve models. Designed for processing large data sets, Spark has been a popular solution, yet it is one that can be challenging to manage, especially for users who are new to big data processing or distributed systems.

Data Engineer

Data Engineer Data Engineering Scala Engineering

How to Turn a REST API Into a Data Stream with Kafka and Flink

Confluent

JUNE 17, 2024

Improve REST API response data w/Kafka and Flink SQL in Confluent Cloud; Automatic connector retriability combats REST flakiness; Demo w/OpenSky data.

Kafka

Kafka SQL Cloud Data

Beginner’s Guide to Machine Learning Testing With DeepChecks

KDnuggets

JUNE 19, 2024

Perform data integrity tests and generate model evaluation reports by writing a few lines of code.

Machine Learning

Machine Learning Data Integration Coding Data

Data Engineering Weekly #176

Data Engineering Weekly

JUNE 16, 2024

Experience Enterprise-Grade Apache Airflow Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more. Learn More → Databricks: Open Sourcing Unity Catalog This week brought many exciting developments, with Snowflake and Databricks announcing open-source catalogs.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Streamlit in Snowflake: Improved Customization, Performance and AI Capabilities

Snowflake

JUNE 17, 2024

Snowflake’s mission is to mobilize the entire world’s data, and there are millions of data scientists and developers who don’t have access to full-stack engineering teams. It’s been our endeavor to bring the power of the AI Data Cloud to every individual developer, data scientist and machine learning engineer, so that they can build and share world-class data apps — all by themselves.

Python

Python Media AWS Machine Learning

PVF: A novel metric for understanding AI systems’ vulnerability against SDCs in model parameters

Engineering at Meta

JUNE 19, 2024

We’re introducing parameter vulnerability factor (PVF) , a novel metric for understanding and measuring AI systems’ vulnerability against silent data corruptions (SDCs) in model parameters. PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models. We’re sharing results of our own case studies using PVF to measure the impact of SDCs in model parameters, as well as potential methods of identifying SDCs in model

Systems

Systems Deep Learning Manufacturing Architecture

5 Free Templates for Data Science Projects on Jupyter Notebook

KDnuggets

JUNE 17, 2024

Boost your data science project with these templates.

Data Science

Data Science Project Data Programming

The Importance of Recognizing Juneteenth

Cloudera

JUNE 21, 2024

Juneteenth holds profound significance in the history of freedom and equality for Black Americans. Also known as Freedom Day or Emancipation Day, Juneteenth commemorates the anniversary of June 19, 1865, when news of the Emancipation Proclamation reached Galveston, Texas, finally declaring freedom for enslaved Americans held in the Confederacy–more than two years after the proclamation was issued in on January 1, 1863.

Education

Education Building IT

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

Snowflake Startup Spotlight: API Generation with DreamFactory

Snowflake

JUNE 18, 2024

Welcome to Snowflake’s Startup Spotlight, where we ask startups about the problems they’re solving, the apps they’re building and the lessons they’ve learned during their startup journey. In this edition, we’ll learn why Terence Bennett, CEO of DreamFactory , and his team are championing a new way to think about API integrations. What was the genesis of DreamFactory?

Datasets

Datasets Programming Accessibility Accessible

A Recap of the Data Engineering Open Forum at Netflix

Netflix Tech

JUNE 20, 2024

A summary of sessions at the first Data Engineering Open Forum at Netflix on April 18th, 2024 The Data Engineering Open Forum at Netflix on April 18th, 2024. At Netflix, we aspire to entertain the world, and our data engineering teams play a crucial role in this mission by enabling data-driven decision-making at scale. Netflix is not the only place where data engineers are solving challenging problems with creative solutions.

Data Engineer

Data Engineer Data Engineering Engineering Data Warehouse

A Tour of Python NLP Libraries

KDnuggets

JUNE 17, 2024

Exploring the available text Python packages for your data workflow.

Python

Python Data Workflow Data

What’s new for CAD and BIM in ArcGIS Pro 3.3

ArcGIS

JUNE 17, 2024

Discover what's new in ArcGIS Pro 3.3 for CAD and BIM workflows, allowing you to directly read datasets from Autodesk Revit, Civil 3D, and Industry Foundation Classes.

Datasets

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

It’s Not Just About AI: Does Your Data Strategy Match Your Ambition?

Snowflake

JUNE 20, 2024

Recent Snowflake workshops and roundtables have started with the question: “Does your data strategy match your AI ambition?” It certainly sparks customer engagement, but is that the right question to ask? Right now, it seems appropriate with all of the interest — dare I say “hype” — around AI. But it merely reflects the current darling of the tech world, focusing on the technology itself, rather than the ultimate goal.

Food

Food Manufacturing Data Technology

PySpark Explained: The explode and collect_list Functions

Towards Data Science

JUNE 17, 2024

Two useful functions to nest and un-nest data sets in PySpark Continue reading on Towards Data Science »

Data Science

Data Science Data SQL Data Engineer

How to Implement Agentic RAG Using LangChain: Part 1

KDnuggets

JUNE 19, 2024

Learn about enhancing LLMs with real-time information retrieval and intelligent agents.

Empowering Enterprise Generative AI with Flexibility: Navigating the Model Landscape

Cloudera

JUNE 18, 2024

The world of Generative AI (GenAI) is rapidly evolving, with a wide array of models available for businesses to leverage. These models can be broadly categorized into two types: closed-source (proprietary) and open-source models. Closed-source models, such as OpenAI’s GPT-4o, Anthropic’s Claude 3, or Google’s Gemini 1.5 Pro, are developed and maintained by private and public companies.

Google Cloud

Google Cloud Government Cloud Architecture

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m

Software Engineer

Sat.Jun 15, 2024 - Fri.Jun 21, 2024

What I’ve Learned After A Decade Of Data Engineering

5 Free Artificial Intelligence Courses from Top Universities

Webinars

Trending Sources

Databricks, Snowflake and the future

Webinars

Being Data Driven At Stripe With Trino And Iceberg

A Guide to Debugging Apache Airflow® DAGs

OpenAI Acquires Rockset

Deploying Machine Learning Models: A Step-by-Step Tutorial

Delta Lake table as a changelog

Sign up to get articles personalized to your interests!

More Trending

Delta Lake table as a changelog

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Databricks Named a Leader in 2024 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms

Creating AI-Driven Solutions: Understanding Large Language Models

Boost your Productivity with Tool Parameter Overrides in ArcGIS Pro 3.3

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

The Future of Telecoms: Embracing Gen AI as a Strategic Competitive Advantage

Cloudera Unveils Plans for Annual Pride Celebration in Cork

A Simple to Implement End-to-End Project with HuggingFace

Santalucía Seguros: Enterprise-level RAG for Enhanced Customer Service and Agent Productivity

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

How to Turn a REST API Into a Data Stream with Kafka and Flink

Beginner’s Guide to Machine Learning Testing With DeepChecks

Data Engineering Weekly #176

How to Modernize Manufacturing Without Losing Control

Streamlit in Snowflake: Improved Customization, Performance and AI Capabilities

PVF: A novel metric for understanding AI systems’ vulnerability against SDCs in model parameters

5 Free Templates for Data Science Projects on Jupyter Notebook

The Importance of Recognizing Juneteenth

The Ultimate Guide to Apache Airflow DAGS

Snowflake Startup Spotlight: API Generation with DreamFactory

A Recap of the Data Engineering Open Forum at Netflix

A Tour of Python NLP Libraries

What’s new for CAD and BIM in ArcGIS Pro 3.3

Apache Airflow® Best Practices: DAG Writing

It’s Not Just About AI: Does Your Data Strategy Match Your Ambition?

PySpark Explained: The explode and collect_list Functions

How to Implement Agentic RAG Using LangChain: Part 1

Empowering Enterprise Generative AI with Flexibility: Navigating the Model Landscape

How to Achieve High-Accuracy Results When Using LLMs

Stay Connected