2025

article thumbnail

Are LLMs making StackOverflow irrelevant?

The Pragmatic Engineer

Hi, this is Gergely with a bonus issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. This article is one out of five sections from The Pulse #119. Full subscribers received this issue a week and a half ago. To get articles like this in your inbox, subscribe here.

article thumbnail

How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?

Start Data Engineering

1. Introduction 2. Split your SQL into smaller parts 2.1. Start with a baseline validation to ensure that your changes do not change the output too much 2.2. Split your CTAs/Subquery into separate functions (or models if using dbt) 2.3. Unit test your functions for maintainability and evolution of logic 3. Conclusion 4. Required reading 1. Introduction If you’ve been in the data space long enough, you would have come across really long SQL scripts that someone had written years ago.

SQL 147
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

10 GitHub Repositories to Master Math

KDnuggets

Learn math through roadmaps, courses, tutorials, Python frameworks for solving equations, guides, exercises, textbooks, and more.

Python 154
article thumbnail

Data News — Week 25.02

Christophe Blefari

HNY 2025 ( credits ) Happy new year ✨ I wish you the best for 2025. There are multiple ways to start a new year, either with new projects, new ideas, new resolutions or by just keeping doing the same music. I hope you will enjoy 2025. The Data News are here to stay, the format might vary during the year, but here we are for another year. Thank you so much for your support through the years.

Data 130
article thumbnail

Apache Airflow® 101 Essential Tips for Beginners

Apache Airflow® is the open-source standard to manage workflows as code. It is a versatile tool used in companies across the world from agile startups to tech giants to flagship enterprises across all industries. Due to its widespread adoption, Airflow knowledge is paramount to success in the field of data engineering.

article thumbnail

Introducing SAP Databricks

databricks

Today we are announcing a deep partnership with SAP which we think can be game changing for our industry. In short, it is.

IT 131

More Trending

article thumbnail

Overwriting partitioned tables in Apache Spark SQL

Waitingforcode

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does. True, but is there an alternative to it that doesn't require using this position-based function?

SQL 130
article thumbnail

Data Integrity for AI: What’s Old is New Again

Precisely

Artificial Intelligence (AI) is all the rage, and rightly so. By now most of us have experienced how Gen AI and the LLMs (large language models) that fuel it are primed to transform the way we create, research, collaborate, engage, and much more. Yet along with the AI hype and excitement comes very appropriate sanity-checks asking whether AI is ready for prime-time.

article thumbnail

Why Pivot Tables Never Die

Simon Späti

While everyone’s talking about AI revolutionizing business, there’s a quiet renaissance happening with one of the most influential business tools created: the pivot table. In 2025, we’re witnessing something remarkable - modern data tools are bringing pivot tables back to the forefront. But why would cutting-edge platforms invest in a decades-old spreadsheet feature?

Coding 130
article thumbnail

Bridging the Data Divide: How Confluent and Databricks Are Unlocking Real-Time AI

Confluent

An expanded partnership between Confluent and Databricks dramatically simplifies the integration between analytical and operational systems.

Systems 118
article thumbnail

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

In this episode of Unapologetically Technical, I interview Semih Salihoglu, Associate Professor at the University of Waterloo and co-founder and CEO of Kuzu. Semih is a researcher and entrepreneur with a background in distributed systems and databases. He shares his journey from a small city in Turkey to the hallowed halls of Yale University, where he studied computer science and economics.

article thumbnail

Testing and Development for Databricks Environment and Code.

Confessions of a Data Guy

Every once in a great while, the question comes up: “How do I test my Databricks codebase?” It’s a fair question, and if you’re new to testing your code, it can seem a little overwhelming on the surface. However, I assure you the opposite is the case. Testing your Databricks codebase is no different than […] The post Testing and Development for Databricks Environment and Code. appeared first on Confessions of a Data Guy.

Coding 114
article thumbnail

Should Python Data Pipelines be Function based or Object-Oriented (OOP)?

Start Data Engineering

1. Introduction 2. Data transformations as functions lead to maintainable code 3. Objects help track things (aka state) 3.1. Track connections & configs when connecting to external systems 3.2. Track pipeline progress (logging, Observer) with objects 3.3. Use objects to store configurations of data systems (e.g., Spark, etc.) 4. Class lets you define reusable code and pipeline patterns 4.1.

article thumbnail

What Are Large Language Models? A Beginner’s Guide for 2025

KDnuggets

Curious about what LLMs are and want to know about them? Explore the Full Guide Right Here, Right Now!

146
146
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Seattle Data Guy

PDF files are one of the most popular file formats today. Because they can preserve the visual layout of documents and are compatible with a wide range of devices and operating systems, PDFs are used for everything from business forms and educational material to creative designs. However, PDF files also present multiple challenges when it… Read more The post What Is PDFMiner And Should You Use It – How To Extract Data From PDFs appeared first on Seattle Data Guy.

IT 130
article thumbnail

DeepSeek R1 on Databricks

databricks

Deepseek-R1 is a state-of-the-art open model that, for the first time, introduces the reasoning capability to the open source community. In particular, the.

133
133
article thumbnail

Strobelight: A profiling service built on open source technology

Engineering at Meta

Were sharing details about Strobelight, Metas profiling orchestrator. Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet. Using Strobelight, weve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers worth of annual capacity savings.

article thumbnail

The insertInto trap in Apache Spark SQL

Waitingforcode

Even though Apache Spark SQL provides an API for structured data, the framework sometimes behaves unexpectedly. It's the case of an insertInto operation that can even lead to some data quality issues. Why? Let's try to understand in this short article.

SQL 130
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Predictions 2025: AI As Cybersecurity Tool and Target

Snowflake

Though AI is (still) the hottest technology topic, its not the overriding issue for enterprise security in 2025. Advanced AI will open up new attack vectors and also deliver new tools for protecting an organizations data. But the underlying challenge is the sheer quantity of data that overworked cybersecurity teams face as they try to answer basic questions such as, Are we under attack?

Data Lake 100
article thumbnail

The Data Engineering Toolkit: Essential Tools for Your Machine

Simon Späti

To be proficient as a data engineer, you need to know various toolkitsfrom fundamental Linux commands to different virtual environments and optimizing efficiency as a data engineer. This article focuses on the building blocks of data engineering work, such as operating systems, development environments, and essential tools. We’ll start from the ground upexploring crucial Linux commands, containerization with Docker, and the development environments that make modern data engineering possibl

article thumbnail

Color Schemes for the Global Wind Atlas

ArcGIS

Mix colors to build a theme for the new multidimensional Global Wind Atlas that's now available in ArcGIS Living Atlas.

Building 115
article thumbnail

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Towards Data Science

Building more efficient AI TLDR : Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST to classify handwritten digits. Best runs for furthest-from-centroid selection compared to full dataset. Image byauthor. What if I told you that using just 50% of your training data could achieve better results than using the fulldataset?

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda

Confessions of a Data Guy

Building fun things is a real part of Data Engineering. Using your creative side when building a Lake House is possible, and using tools that are outside the normal box can sometimes be preferable. Checkout this video where I dive into how I build just such a Lake House using Modern Data Stack tools like […] The post Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda appeared first on Confessions of a Data Guy.

AWS 130
article thumbnail

How to ensure consistent metrics in your warehouse

Start Data Engineering

1. Introduction 2. Centralize Metric Definitions in Code Option A: Semantic Layer for On-the-Fly Queries Option B: Pre-Aggregated Tables for Consumers 3. Conclusion & Recap 4. Required Reading 1. Introduction If youve worked on a data team, youve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist.

Utilities 147
article thumbnail

5 AI Agent Frameworks Compared

KDnuggets

Check out this comparison of 5 AI frameworks to determine which you should choose.

139
139
article thumbnail

The Alarming Cost of Poor Data Quality

Monte Carlo

When data engineers tell scary stories around a campfire, its usually a cautionary tale about the cost of poor data quality. Data downtime can occur suddenly at any timeand often not when or where youre looking for it. And its cost is the scariest part of all. But just how much can data downtime actually cost your business? In this article, well learn from a real-life data downtime horror story to understand the cost of bad data, its impacts, and how to prevent it.

Data 98
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

databricks

In the rapidly evolving landscape of data engineering and analytics, speed, scalability, and simplicity are invaluable. Serverless compute addresses these needs by eliminating.

article thumbnail

ILA Evo: Meta’s journey to reimagine fiber optic in-line amplifier sites

Engineering at Meta

Today’s rapidly evolving landscape of use cases that demand highly performant and efficient network infrastructure is placing new emphasis on how in-line amplifiers (ILAs) are designed and deployed. Metas ILA Evo effort seeks to reimagine how an ILA site could be deployed to improve speed and cost while making a step function improvement in power efficiency.

article thumbnail

Delta Lake and restore - traveling in time differently

Waitingforcode

Time travel is a quite popular Delta Lake feature. But do you know it's not the single one you can use to interact with the past versions? An alternative is the RESTORE command, and it'll be the topic of this blog post.

IT 130