Data Engineering Digest

Trending Articles

Event time skew and global watermark in Apache Spark Structured Streaming

Waitingforcode

JANUARY 15, 2025

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Seattle Data Guy

JANUARY 18, 2025

PDF files are one of the most popular file formats today. Because they can preserve the visual layout of documents and are compatible with a wide range of devices and operating systems, PDFs are used for everything from business forms and educational material to creative designs. However, PDF files also present multiple challenges when it… Read more The post What Is PDFMiner And Should You Use It – How To Extract Data From PDFs appeared first on Seattle Data Guy.

IT Education Data Designing

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

7 Easy Ways to Make Passive Income with Large Language Models

KDnuggets

JANUARY 14, 2025

Looking for extra income? Here are 7 creative ways to use large language models for passive earnings!

The Alarming Cost of Poor Data Quality

Monte Carlo

JANUARY 14, 2025

When data engineers tell scary stories around a campfire, its usually a cautionary tale about the cost of poor data quality. Data downtime can occur suddenly at any timeand often not when or where youre looking for it. And its cost is the scariest part of all. But just how much can data downtime actually cost your business? In this article, well learn from a real-life data downtime horror story to understand the cost of bad data, its impacts, and how to prevent it.

Data

Data Data Engineer Data Engineering Media

Apache Airflow® Crash Course: From 0 to Running your Pipeline in the Cloud

With over 30 million monthly downloads, Apache Airflow is the tool of choice for programmatically authoring, scheduling, and monitoring data pipelines. Airflow enables you to define workflows as Python code, allowing for dynamic and scalable pipelines suitable to any use case from ETL/ELT to running ML/AI operations in production. This introductory tutorial provides a crash course for writing and deploying your first Airflow pipeline.

Cloud

Using Spatial Components in Spatial Statistics

ArcGIS

JANUARY 14, 2025

Learn how to use spatial components to help with various spatial statistics workflows.

How I Would Learn Python in 2025 (If I Could Start Over)

KDnuggets

JANUARY 13, 2025

Ive been programming with Python for over 6 years now. But if I could start over, heres how Id go about learning Python in 2025.

Python

Python Programming

Measuring productivity impact with Diff Authoring Time

Engineering at Meta

JANUARY 16, 2025

Do types actually make developers more productive? Or is it just more typing on the keyboard? To answer that question were revisiting Diff Authoring Time (DAT) how Meta measures how long it takes to submit changes to a codebase. DAT is just one of the ways e measure developer productivity and this latest episode of the Meta Tech Podcast takes a look at two concrete use cases for DAT, including a type-safe mocking framework in Hack.

Engineering

Engineering IT

More Trending

Measuring productivity impact with Diff Authoring Time

Engineering at Meta

JANUARY 16, 2025

Engineering

Engineering IT

PySpark Data Quality on Databricks with DQX.

Confessions of a Data Guy

JANUARY 17, 2025

A Deep Dive into Databricks Labs’ DQX: The Data Quality Game Changer for PySpark DataFrames Recently, a LinkedIn announcement caught my eyeand honestly, it had me on the edge of my seat. Databricks Labs has unveiled DQX, a Python-based Data Quality framework explicitly designed for PySpark DataFrames. Finally, a Dedicated Data Quality Tool for PySpark […] The post PySpark Data Quality on Databricks with DQX. appeared first on Confessions of a Data Guy.

Python

Python Data Designing IT

Unlocking the Power of Geospatial Data for Insights

Snowflake

JANUARY 15, 2025

Over the last three geospatial-centric blog posts, weve covered the basics of what geospatial data is, how it works in the broader world of data and how it specifically works in Snowflake based on our native support for GEOGRAPHY , GEOMETRY and H3. Those articles are great for dipping your toe in, getting a feel for the water and maybe even wading into the shallow end of the pool.

Transportation

Transportation BI Database-centric Metadata

Color Schemes for the Global Wind Atlas

ArcGIS

JANUARY 14, 2025

Mix colors to build a theme for the new multidimensional Global Wind Atlas that's now available in ArcGIS Living Atlas.

Building

A Gentle Introduction to Rust for Python Programmers

KDnuggets

JANUARY 15, 2025

Rust is a systems programming language that offers high performance and safety. Python programmers will find Rust's syntax familiar but with more control over memory and performance.

Python

Python Programming Language Programming Systems

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

Wizerr AI: Revolutionizing Electronics Design and Procurement with Databricks

databricks

JANUARY 14, 2025

Electronic products are evolving at lightning speed, driven by an insatiable demand for new consumer devices, energy, transport, robotics, connectivity, data and beyond.

Electronics

Electronics Designing Transportation Manufacturing

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The Critical Role of AI Data Engineers in a Data-Driven World How does a chatbot seamlessly interpret your questions? How does a self-driving car understand a chaotic street scene? The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. Unlike neatly organized rows and columns in spreadsheets, unstructured data—such as text, images, videos, and audio—requires advanced processing techniques to derive meaningful insights.

Data Engineering

Data Engineering Data Engineer Unstructured Data Engineering

Startup 2025: What AI-Focused VCs Are Looking For

Snowflake

JANUARY 15, 2025

Y Combinator founder Paul Graham advises startup founders to live in the future, then build whats missing. I had the privilege of glimpsing the future through a series of interviews with investors on the bleeding edge of the AI landscape. Insights from these candid conversations laid the foundation for Startup 2025: Building a Business in the Age of AI, the AI startup report that Snowflake is publishing today.

Programming

Programming Building Cloud Management

Filling sinks in DEMs like an expert

ArcGIS

JANUARY 14, 2025

Master the Spatial Analyst Fill tool to eliminate sinks and extract better hydrologic information from your elevation data

Data

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

AI is Getting Smarter, But It Still Can’t Do My Data Science Job.

KDnuggets

JANUARY 17, 2025

A product data scientist breaks down why AI wont replace us anytime soon.

Data Science

Data Science IT Data

How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW…

Towards Data Science

JANUARY 19, 2025

Make the right choice for YOU Continue reading on Towards Data Science

Data Science

Data Science Data Analytics Engineering Data Engineering

Introducing Analyst Studio: Where analysts become business catalysts

ThoughtSpot

JANUARY 15, 2025

With the ever-growing focus on GenAI, many legacy BI tools have failed to invest in the analyst. By focusing solely on AI experiences for business teams, theyve alienated data teams, relegating analysts to disjointed tools and data silos. When in reality, businesses still need people who can help decision-makers assess messy data to diagnose and evaluate business problems.

BI SQL Data Warehouse Datasets

SwiftKV Cuts LLM Inference Costs by 75% with Snowflake Cortex AI

Snowflake

JANUARY 16, 2025

Large language models (LLMs) are at the heart of generative AI transformations, driving solutions across industries from efficient customer support to simplified data analysis. Enterprises need performant, cost-effective and low-latency inference to scale their gen AI solutions. Yet, the complexity and computational demands of LLM inference present a challenge.

Algorithm

Algorithm Data Analysis Building Process

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Event-Driven AI: Building a Research Assistant with Kafka and Flink

Confluent

JANUARY 14, 2025

PodPrep AI, an AI-powered research assistant, leverages EDA and real-time streaming data using Confluent and Flink, in order to help its author with podcast preparation.

Kafka

Kafka Building IT Data

My Top Picks: 5 Free NLP Courses I’d Recommend for 2025

KDnuggets

JANUARY 16, 2025

Want to become an NLP pro by 2025? Check out these top free courses and learn from experts whove shaped the future of language models.

How Optimizing Memory Management with LMDB Boosted Performance on Our API Service

Pinterest Engineering

JANUARY 13, 2025

Angel Vargas | Software Engineer, API Platform; Swati Kumar | Software Engineer, API Platform; Chris Bunting | Engineering Manager, APIPlatform The inside of the Pinterest lobby in Mexico City, showing a patterned ceiling, a receptionist deck with a plant on it, a light above it, and a gallery of images of pins youd find on Pinterest, behind it. To the left, a glowing Pinterest P sign hovers in front of a glasswall.

Management

Management Bytes Python Software Engineering

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

Architecture

The AI Tipping Point: What Manufacturing Leaders Need to Know for 2025

Snowflake

JANUARY 16, 2025

AI is proving that its here to stay. While 2023 brought wonder, and 2024 saw widespread experimentation, 2025 will be the year that manufacturing enterprises get serious about AI's applications. But its complicated: AI proofs of concept are graduating from the sandbox to production, just as some of AIs biggest cheerleaders are turning a bit dour. How to navigate such a landscape is top of mind for me and top executives such as Snowflakes CEO, Sridhar Ramaswamy; Snowflakes Distinguished AI Engine

Manufacturing

Manufacturing Cloud Engineering IT

Snowflake PARSE_DOC Meets Snowpark Power

Cloudyard

JANUARY 15, 2025

Read Time: 2 Minute, 33 Second Snowflakes PARSE_DOCUMENT function revolutionizes how unstructured data, such as PDF files, is processed within the Snowflake ecosystem. Traditionally, this function is used within SQL to extract structured content from documents. However, Ive taken this a step further, leveraging Snowpark to extend its capabilities and build a complete data extraction process.

Data Cleanse

Data Cleanse Insurance Raw Data Unstructured Data

10 Python One-Liners That Will Change Your Coding Game

KDnuggets

JANUARY 17, 2025

A not-to-be-missed list of elegant Python solutions to perform common programming and processing tasks in a single line of code.

Coding

Coding Python Programming Process

Enhancing Asset Monitoring: The Power of Dictionary Renderers for Status Indicators

ArcGIS

JANUARY 17, 2025

Use this dictionary renderer symbology to display clear and concise status information for locations of interest.

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

Data

Data Products 101: Everything You Need to Know

Monte Carlo

JANUARY 13, 2025

Twenty years ago, data was little more than fuel for forecasting. A few marketing insights here. A couple financial reports there. Today, data doesnt simply support your productsmore often than not, it is the product. In the age of AI, data isnt just another cost centerits a value creator. Data teams arent service providerstheyre essential technology partners.

Data

Data Datasets Government Machine Learning

2024: A Year of Structural Transformation

DareData

JANUARY 15, 2025

DareData will close 2024 with a 5% revenue growth compared to 2023. At first glance, given the rapid growth in our market, one might be tempted to classify this year as underwhelming. However, 2024 has been a transformative year for us. We started the year as a 100% consulting business. Consulting is highly dependent on people, and in small boutique firms like ours, this often means being heavily reliant on the partners.

Consulting

Consulting Finance Data Science Project

Build Your First API with Python and AWS

Towards Data Science

JANUARY 14, 2025

Learn how to create a simple, yet powerful REST API with FastAPI, DynamoDB, and AWS Lambda Functions.

AWS

AWS Python Building Data Science

Exploring Multilingual LLMs with Aya Expanse

KDnuggets

JANUARY 15, 2025

Read this to understand the most advanced open source multilingual model.

The Cloud Development Environment Adoption Report

Cloud Development Environments (CDEs) are changing how software teams work by moving development to the cloud. Our Cloud Development Environment Adoption Report gathers insights from 223 developers and business leaders, uncovering key trends in CDE adoption. With 66% of large organizations already using CDEs, these platforms are quickly becoming essential to modern development practices.

Cloud

Trending Articles

Event time skew and global watermark in Apache Spark Structured Streaming

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Trending Sources

7 Easy Ways to Make Passive Income with Large Language Models

The Alarming Cost of Poor Data Quality

Apache Airflow® Crash Course: From 0 to Running your Pipeline in the Cloud

Using Spatial Components in Spatial Statistics

How I Would Learn Python in 2025 (If I Could Start Over)

Measuring productivity impact with Diff Authoring Time

Sign up to get articles personalized to your interests!

More Trending

Measuring productivity impact with Diff Authoring Time

PySpark Data Quality on Databricks with DQX.

Unlocking the Power of Geospatial Data for Insights

Color Schemes for the Global Wind Atlas

A Gentle Introduction to Rust for Python Programmers

Apache Airflow® Best Practices: DAG Writing

Wizerr AI: Revolutionizing Electronics Design and Procurement with Databricks

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Startup 2025: What AI-Focused VCs Are Looking For

Filling sinks in DEMs like an expert

Optimizing The Modern Developer Experience with Coder

AI is Getting Smarter, But It Still Can’t Do My Data Science Job.

How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW…

Introducing Analyst Studio: Where analysts become business catalysts

SwiftKV Cuts LLM Inference Costs by 75% with Snowflake Cortex AI

15 Modern Use Cases for Enterprise Business Intelligence

Event-Driven AI: Building a Research Assistant with Kafka and Flink

My Top Picks: 5 Free NLP Courses I’d Recommend for 2025

Top 3 Questions to Ask in Near Real-Time Data Solutions

How Optimizing Memory Management with LMDB Boosted Performance on Our API Service

Apache Airflow® Best Practices for ETL and ELT Pipelines

The AI Tipping Point: What Manufacturing Leaders Need to Know for 2025

Snowflake PARSE_DOC Meets Snowpark Power

10 Python One-Liners That Will Change Your Coding Game

Enhancing Asset Monitoring: The Power of Dictionary Renderers for Status Indicators

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Data Products 101: Everything You Need to Know

2024: A Year of Structural Transformation

Build Your First API with Python and AWS

Exploring Multilingual LLMs with Aya Expanse

The Cloud Development Environment Adoption Report

Stay Connected