February, 2024

article thumbnail

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and sc

Data Lake 262
article thumbnail

Happy Leap Day!

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of three topics from today’s subscriber-only The Pulse issue. Subscribe to get issues like this in your inbox, every week.

article thumbnail

Kafka to MongoDB: Building a Streamlined Data Pipeline

Analytics Vidhya

Introduction Data is fuel for the IT industry and the Data Science Project in today’s online world. IT industries rely heavily on real-time insights derived from streaming data sources. Handling and processing the streaming data is the hardest work for Data Analysis. We know that streaming data is data that is emitted at high volume […] The post Kafka to MongoDB: Building a Streamlined Data Pipeline appeared first on Analytics Vidhya.

MongoDB 217
article thumbnail

Data Warehousing Essentials: A Guide To Data Warehousing

Seattle Data Guy

Photo by Tiger Lily Data warehouses and data lakes play a crucial role for many businesses. It gives businesses access to the data from all of their various systems. As well as often integrating data so that end-users can answer business critical questions. But if we take a step back and only focus on the… Read more The post Data Warehousing Essentials: A Guide To Data Warehousing appeared first on Seattle Data Guy.

Data Lake 162
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Collection of Free Courses to Learn Data Science, Data Engineering, Machine Learning, MLOps, and LLMOps

KDnuggets

Begin your data professional journey from the basics of statistics to building a production-grade AI application.

article thumbnail

ArcGIS Pro 3.3 Moves to.NET 8

ArcGIS

ArcGIS Pro 3.3 is planned to be available in May 2024. Install.NET 8 before attempting to install ArcGIS Pro 3.3 for the best user experience!

143
143

More Trending

article thumbnail

OLMo is Here, Powered by Mosaic AI + Databricks

databricks

As Chief Scientist (Neural Networks) at Databricks, I lead our research team toward the goal of giving everyone the ability to build and.

Building 143
article thumbnail

Introducing Apache Kafka 3.7

Confluent

Apache Kafka 3.7 introduces updates to the Consumer rebalance protocol, an official Apache Kafka Docker image, JBOD support in Kraft-based clusters, and more!

Kafka 140
article thumbnail

DotSlash: Simplified executable deployment

Engineering at Meta

We’ve open sourced DotSlash , a tool that makes large executables available in source control with a negligible impact on repository size, thus avoiding I/O-heavy clone operations. With DotSlash, a set of platform-specific executables is replaced with a single script containing descriptors for the supported platforms. DotSlash handles transparently fetching, decompressing, and verifying the appropriate remote artifact for the current operating system and CPU.

Metadata 135
article thumbnail

5 FREE Courses on AI and ChatGPT to Take You From 0-100

KDnuggets

Want to learn more about AI and ChatGPT in 2024 for FREE? Keep reading.

159
159
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Access Over 181,000 USGS Historical Topographic Maps

ArcGIS

We recently updated our online USGS historical topographic map collection with over 1,745 new maps for a new total of over 181,000 maps.

article thumbnail

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process.

Database 162
article thumbnail

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

databricks

Special thanks to Phillip Jones, Senior Product Manager, and Harshal Brahmbhatt, Systems Engineer from Cloudflare for their contributions to this blog. Organizations across.

article thumbnail

Introducing DoorDash’s In-House Search Engine

DoorDash Engineering

We reviewed the architecture of our global search at DoorDash in early 2022 and concluded that our rapid growth meant within three years we wouldn’t be able to scale the system efficiently, particularly as global search shifted from store-only to a hybrid item-and-store search experience. Our analysis identified Elasticsearch as our architecture’s primary bottleneck.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Anatomy of a Structured Streaming job

Waitingforcode

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

130
130
article thumbnail

Free Data Analyst Bootcamp for Beginners

KDnuggets

Want to become a data analyst? This free beginner-friendly data analyst bootcamp is all you need.

Data 158
article thumbnail

Alternatives to SSIS(SQL Server Integration Services) – How To Migrate Away From SSIS

Seattle Data Guy

SQL Server Integration Services (SSIS) comes with a lot of functionality useful for extracting, transforming, and loading data. It can also play important roles in application development and other projects. But SSIS is far from the only platform that can provide these services. You might seek alternatives to SSIS because you want a more agile… Read more The post Alternatives to SSIS(SQL Server Integration Services) – How To Migrate Away From SSIS appeared first on Seattle Data Guy.

SQL 130
article thumbnail

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building

Data Lake 147
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Data News — Week 24.08

Christophe Blefari

My ideas these days ( credits ) Hey, fresh Data News edition. This week I've participated to a round table about data and did a cool presentation about Engines. The idea was to depict the history of engines over the last 40 years and what leads to polars and DuckDB. Obviously the I forgot a few things and I'll do a more complete v2 soon. This is my third presentation about DuckDB in the last 3 months and I think I'll slow down a bit until I get new crazy things to share.

Data Lake 130
article thumbnail

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

1. Introduction 2. Setup & Logging architecture 3. Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4. Monitoring UI & Traceability 3.5.

Metadata 130
article thumbnail

Simple Precision Time Protocol at Meta

Engineering at Meta

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources. In our own tests, SPTP boasts comparable performance to PTP, but with significant improvements in CPU, memory, and network utilization.

Utilities 130
article thumbnail

Vector Database for LLMs, Generative AI, and Deep Learning

KDnuggets

Exploring the limitless possibilities of AI and making it context-aware.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Min rate limits for Apache Kafka

Waitingforcode

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Kafka 130
article thumbnail

Why Your Team Needs To Implement Data Quality For Your AI Strategy

Seattle Data Guy

Companies that range from start-ups to enterprises are looking to implement AI and ML into their data strategy. With that it’s important not to forget about data quality. Regardless of how fancy or sophisticated a company’s AI model might be, poor data quality will break it. It will make the outputs of these models useless… Read more The post Why Your Team Needs To Implement Data Quality For Your AI Strategy appeared first on Seattle Data Guy.

Data 130
article thumbnail

Data News — Week 24.05

Christophe Blefari

hey ( credits ) Hello here, this is Christophe from Amsterdam. I hope you're doing good. I'm in Amsterdam for the day for the DuckCon #4. The DuckDB annual conference, and god I like Europe. Being able to travel by train from Berlin to Paris to Amsterdam while going to the west of France for a lecture in a week is something truly awesome. Anyway this week will be a mixed Data News with links, stuff and ideas and a small wrap-up of the DuckCon + the stuff I presented on Wed. to a Modern

MongoDB 130
article thumbnail

Performance Improvements for Stateful Pipelines in Apache Spark Structured Streaming

databricks

Introduction Apache Spark™ Structured Streaming is a popular open-source stream processing platform that provides scalability and fault tolerance, built on top of the S.

Process 128
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Welcome Noteable: Making Data Streaming Easier and More Approachable

Confluent

Confluent has hired many Noteable employees to help make application development easier for both Kafka and Flink developers.

Kafka 127
article thumbnail

Large Language Models Explained in 3 Levels of Difficulty

KDnuggets

Simple explanations, no matter what your level is right now.

156
156
article thumbnail

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

Waitingforcode

It's time for another part of "What's new on the cloud for data engineers" Let's see what happened in the last 5 months.

article thumbnail

One map to rule them all

ArcGIS

All we have to decide is what to map with the time that is given us.

122
122
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.