Top Data Engineering Digest High Quality Data Accessibility Content for February, 2024

February, 2024

Happy Leap Day!

The Pragmatic Engineer

FEBRUARY 29, 2024

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of three topics from today’s subscriber-only The Pulse issue. Subscribe to get issues like this in your inbox, every week.

Software Engineering

Software Engineering Software Engineer Banking Engineering

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and sc

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Kafka to MongoDB: Building a Streamlined Data Pipeline

Analytics Vidhya

FEBRUARY 28, 2024

Introduction Data is fuel for the IT industry and the Data Science Project in today’s online world. IT industries rely heavily on real-time insights derived from streaming data sources. Handling and processing the streaming data is the hardest work for Data Analysis. We know that streaming data is data that is emitted at high volume […] The post Kafka to MongoDB: Building a Streamlined Data Pipeline appeared first on Analytics Vidhya.

MongoDB

MongoDB Data Pipeline Kafka Building

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Warehousing Essentials: A Guide To Data Warehousing

Seattle Data Guy

FEBRUARY 10, 2024

Photo by Tiger Lily Data warehouses and data lakes play a crucial role for many businesses. It gives businesses access to the data from all of their various systems. As well as often integrating data so that end-users can answer business critical questions. But if we take a step back and only focus on the… Read more The post Data Warehousing Essentials: A Guide To Data Warehousing appeared first on Seattle Data Guy.

Data Lake

Data Lake Data Warehouse Data Accessibility

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Collection of Free Courses to Learn Data Science, Data Engineering, Machine Learning, MLOps, and LLMOps

KDnuggets

FEBRUARY 28, 2024

Begin your data professional journey from the basics of statistics to building a production-grade AI application.

Machine Learning

Machine Learning Data Science Data Engineering Data Engineer

ArcGIS Pro 3.3 Moves to.NET 8

ArcGIS

FEBRUARY 21, 2024

ArcGIS Pro 3.3 is planned to be available in May 2024. Install.NET 8 before attempting to install ArcGIS Pro 3.3 for the best user experience!

OLMo is Here, Powered by Mosaic AI + Databricks

databricks

FEBRUARY 1, 2024

As Chief Scientist (Neural Networks) at Databricks, I lead our research team toward the goal of giving everyone the ability to build and.

Building

Building Engineering

More Trending

OLMo is Here, Powered by Mosaic AI + Databricks

databricks

FEBRUARY 1, 2024

As Chief Scientist (Neural Networks) at Databricks, I lead our research team toward the goal of giving everyone the ability to build and.

Building

Building Engineering

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

SQL

SQL Data Lake High Quality Data Machine Learning

Introducing Apache Kafka 3.7

Confluent

FEBRUARY 27, 2024

Apache Kafka 3.7 introduces updates to the Consumer rebalance protocol, an official Apache Kafka Docker image, JBOD support in Kraft-based clusters, and more!

Kafka

Introducing DoorDash’s In-House Search Engine

DoorDash Engineering

FEBRUARY 27, 2024

We reviewed the architecture of our global search at DoorDash in early 2022 and concluded that our rapid growth meant within three years we wouldn’t be able to scale the system efficiently, particularly as global search shifted from store-only to a hybrid item-and-store search experience. Our analysis identified Elasticsearch as our architecture’s primary bottleneck.

Engineering

Engineering Systems Designing Architecture

5 FREE Courses on AI and ChatGPT to Take You From 0-100

KDnuggets

FEBRUARY 6, 2024

Want to learn more about AI and ChatGPT in 2024 for FREE? Keep reading.

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Access Over 181,000 USGS Historical Topographic Maps

ArcGIS

FEBRUARY 16, 2024

We recently updated our online USGS historical topographic map collection with over 1,745 new maps for a new total of over 181,000 maps.

Accessible

Accessible Accessibility Government

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

databricks

FEBRUARY 29, 2024

Special thanks to Phillip Jones, Senior Product Manager, and Harshal Brahmbhatt, Systems Engineer from Cloudflare for their contributions to this blog. Organizations across.

Engineering

Engineering Systems Management

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process.

Database

Database Technology Data Lake High Quality Data

Anatomy of a Structured Streaming job

Waitingforcode

FEBRUARY 27, 2024

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Simple Precision Time Protocol at Meta

Engineering at Meta

FEBRUARY 7, 2024

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources. In our own tests, SPTP boasts comparable performance to PTP, but with significant improvements in CPU, memory, and network utilization.

Utilities

Utilities Coding Designing Accessible

Free Data Analyst Bootcamp for Beginners

KDnuggets

FEBRUARY 27, 2024

Want to become a data analyst? This free beginner-friendly data analyst bootcamp is all you need.

Data

Data Data Science

Alternatives to SSIS(SQL Server Integration Services) – How To Migrate Away From SSIS

Seattle Data Guy

FEBRUARY 26, 2024

SQL Server Integration Services (SSIS) comes with a lot of functionality useful for extracting, transforming, and loading data. It can also play important roles in application development and other projects. But SSIS is far from the only platform that can provide these services. You might seek alternatives to SSIS because you want a more agile… Read more The post Alternatives to SSIS(SQL Server Integration Services) – How To Migrate Away From SSIS appeared first on Seattle Data Guy.

SQL

SQL Project Data IT

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

FEBRUARY 22, 2024

1. Introduction 2. Setup & Logging architecture 3. Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4. Monitoring UI & Traceability 3.5.

Metadata

Metadata Data Engineering Data Engineer Engineering

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

FEBRUARY 11, 2024

Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building

Data Lake

Data Lake High Quality Data Government Machine Learning

Min rate limits for Apache Kafka

Waitingforcode

FEBRUARY 20, 2024

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Kafka

Kafka IT Data

Data News — Week 24.07

Christophe Blefari

FEBRUARY 16, 2024

Italy Sora ( credits ) Hey you, time for the Data News. Because I did not send the news last week you will get articles from the 2 last weeks. Last few days have been heavily packed with AI News as well. Disclaimer, the 2 events below will be in French. Before jumping to the news there are a few events I want to write about. Next Wednesday I will participate to a Data Night Talk a open discussion about AI & data engineering with other content creators.

Food

Food Data SQL Government

Vector Database for LLMs, Generative AI, and Deep Learning

KDnuggets

FEBRUARY 28, 2024

Exploring the limitless possibilities of AI and making it context-aware.

Deep Learning

Deep Learning Database IT

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Why Your Team Needs To Implement Data Quality For Your AI Strategy

Seattle Data Guy

FEBRUARY 12, 2024

Companies that range from start-ups to enterprises are looking to implement AI and ML into their data strategy. With that it’s important not to forget about data quality. Regardless of how fancy or sophisticated a company’s AI model might be, poor data quality will break it. It will make the outputs of these models useless… Read more The post Why Your Team Needs To Implement Data Quality For Your AI Strategy appeared first on Seattle Data Guy.

Data

Data IT Big Data Consulting

Performance Improvements for Stateful Pipelines in Apache Spark Structured Streaming

databricks

FEBRUARY 27, 2024

Introduction Apache Spark™ Structured Streaming is a popular open-source stream processing platform that provides scalability and fault tolerance, built on top of the S.

Process

Process Data Engineering Data Engineer Engineering

DotSlash: Simplified executable deployment

Engineering at Meta

FEBRUARY 6, 2024

We’ve open sourced DotSlash , a tool that makes large executables available in source control with a negligible impact on repository size, thus avoiding I/O-heavy clone operations. With DotSlash, a set of platform-specific executables is replaced with a single script containing descriptors for the supported platforms. DotSlash handles transparently fetching, decompressing, and verifying the appropriate remote artifact for the current operating system and CPU.

Metadata

Metadata Coding Building Project

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

Waitingforcode

FEBRUARY 13, 2024

It's time for another part of "What's new on the cloud for data engineers" Let's see what happened in the last 5 months.

Data Engineering

Data Engineering Data Engineer Cloud Engineering

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Data News — Week 24.05

Christophe Blefari

FEBRUARY 3, 2024

hey ( credits ) Hello here, this is Christophe from Amsterdam. I hope you're doing good. I'm in Amsterdam for the day for the DuckCon #4. The DuckDB annual conference, and god I like Europe. Being able to travel by train from Berlin to Paris to Amsterdam while going to the west of France for a lecture in a week is something truly awesome. Anyway this week will be a mixed Data News with links, stuff and ideas and a small wrap-up of the DuckCon + the stuff I presented on Wed. to a Modern

MongoDB

MongoDB SQL Data Data Warehouse

Large Language Models Explained in 3 Levels of Difficulty

KDnuggets

FEBRUARY 15, 2024

Simple explanations, no matter what your level is right now.

Welcome Noteable: Making Data Streaming Easier and More Approachable

Confluent

FEBRUARY 7, 2024

Confluent has hired many Noteable employees to help make application development easier for both Kafka and Flink developers.

Kafka

Kafka Data

Announcing the General Availability of Azure Private Link and Azure Storage firewall support for Databricks SQL Serverless

databricks

FEBRUARY 21, 2024

We are excited to announce the upcoming general availability of Azure Private Link support for Databricks SQL (DBSQL) Serverless, planned in April 2024.

SQL

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

February, 2024

Happy Leap Day!

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Webinars

Trending Sources

Kafka to MongoDB: Building a Streamlined Data Pipeline

Webinars

Data Warehousing Essentials: A Guide To Data Warehousing

A Guide to Debugging Apache Airflow® DAGs

Collection of Free Courses to Learn Data Science, Data Engineering, Machine Learning, MLOps, and LLMOps

ArcGIS Pro 3.3 Moves to.NET 8

OLMo is Here, Powered by Mosaic AI + Databricks

Sign up to get articles personalized to your interests!

More Trending

OLMo is Here, Powered by Mosaic AI + Databricks

Tackling Real Time Streaming Data With SQL Using RisingWave

Introducing Apache Kafka 3.7

Introducing DoorDash’s In-House Search Engine

5 FREE Courses on AI and ChatGPT to Take You From 0-100

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Access Over 181,000 USGS Historical Topographic Maps

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Anatomy of a Structured Streaming job

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Simple Precision Time Protocol at Meta

Free Data Analyst Bootcamp for Beginners

Alternatives to SSIS(SQL Server Integration Services) – How To Migrate Away From SSIS

Data Engineering Best Practices - #2. Metadata & Logging

How to Modernize Manufacturing Without Losing Control

Data Sharing Across Business And Platform Boundaries

Min rate limits for Apache Kafka

Data News — Week 24.07

Vector Database for LLMs, Generative AI, and Deep Learning

Optimizing The Modern Developer Experience with Coder

Why Your Team Needs To Implement Data Quality For Your AI Strategy

Performance Improvements for Stateful Pipelines in Apache Spark Structured Streaming

DotSlash: Simplified executable deployment

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

15 Modern Use Cases for Enterprise Business Intelligence

Data News — Week 24.05

Large Language Models Explained in 3 Levels of Difficulty

Welcome Noteable: Making Data Streaming Easier and More Approachable

Announcing the General Availability of Azure Private Link and Azure Storage firewall support for Databricks SQL Serverless

The Ultimate Guide to Apache Airflow DAGS

Stay Connected