February, 2024

article thumbnail

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and sc

Data Lake 262
article thumbnail

Happy Leap Day!

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of three topics from today’s subscriber-only The Pulse issue. Subscribe to get issues like this in your inbox, every week.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Kafka to MongoDB: Building a Streamlined Data Pipeline

Analytics Vidhya

Introduction Data is fuel for the IT industry and the Data Science Project in today’s online world. IT industries rely heavily on real-time insights derived from streaming data sources. Handling and processing the streaming data is the hardest work for Data Analysis. We know that streaming data is data that is emitted at high volume […] The post Kafka to MongoDB: Building a Streamlined Data Pipeline appeared first on Analytics Vidhya.

MongoDB 217
article thumbnail

Data Warehousing Essentials: A Guide To Data Warehousing

Seattle Data Guy

Photo by Tiger Lily Data warehouses and data lakes play a crucial role for many businesses. It gives businesses access to the data from all of their various systems. As well as often integrating data so that end-users can answer business critical questions. But if we take a step back and only focus on the… Read more The post Data Warehousing Essentials: A Guide To Data Warehousing appeared first on Seattle Data Guy.

Data Lake 162
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Anatomy of a Structured Streaming job

Waitingforcode

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

130
130
article thumbnail

Data News — Week 24.08

Christophe Blefari

My ideas these days ( credits ) Hey, fresh Data News edition. This week I've participated to a round table about data and did a cool presentation about Engines. The idea was to depict the history of engines over the last 40 years and what leads to polars and DuckDB. Obviously the I forgot a few things and I'll do a more complete v2 soon. This is my third presentation about DuckDB in the last 3 months and I think I'll slow down a bit until I get new crazy things to share.

Data Lake 130

More Trending

article thumbnail

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

1. Introduction 2. Setup & Logging architecture 3. Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4. Monitoring UI & Traceability 3.5.

Metadata 130
article thumbnail

Semantic Layers are the Missing Piece for AI-Enabled Analytics

KDnuggets

Integrating a semantic layer with Language Learning Models (LLMs) presents a clean solution to this, particularly in the realm of AI chatbots. This combination empowers businesses to generate fast responses and reports based on their data. Leveraging AI and semantic layers is advancing business intelligence, making it easier than ever for people to interact with data.

article thumbnail

Alternatives to SSIS(SQL Server Integration Services) – How To Migrate Away From SSIS

Seattle Data Guy

SQL Server Integration Services (SSIS) comes with a lot of functionality useful for extracting, transforming, and loading data. It can also play important roles in application development and other projects. But SSIS is far from the only platform that can provide these services. You might seek alternatives to SSIS because you want a more agile… Read more The post Alternatives to SSIS(SQL Server Integration Services) – How To Migrate Away From SSIS appeared first on Seattle Data Guy.

SQL 130
article thumbnail

Introducing Apache Kafka 3.7

Confluent

Apache Kafka 3.7 introduces updates to the Consumer rebalance protocol, an official Apache Kafka Docker image, JBOD support in Kraft-based clusters, and more!

Kafka 140
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Introducing DoorDash’s In-House Search Engine

DoorDash Engineering

We reviewed the architecture of our global search at DoorDash in early 2022 and concluded that our rapid growth meant within three years we wouldn’t be able to scale the system efficiently, particularly as global search shifted from store-only to a hybrid item-and-store search experience. Our analysis identified Elasticsearch as our architecture’s primary bottleneck.

article thumbnail

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process.

Database 162
article thumbnail

ArcGIS Pro 3.3 Moves to.NET 8

ArcGIS

ArcGIS Pro 3.3 is planned to be available in May 2024. Install.NET 8 before attempting to install ArcGIS Pro 3.3 for the best user experience!

143
143
article thumbnail

University of Cincinnati MS Business Analytics Summer 2024 Information Session

KDnuggets

Don't miss this chance to chart your course toward a successful career in business analytics. Reserve your spot now and embark on a journey of knowledge and growth!

136
136
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Robinhood Money Drills Kicks Off 2024 With Three New Universities

Robinhood

Florida State University, Coastal Carolina University, and the University of California, Berkeley will introduce financial education coursework with support from Robinhood Money Drills Robinhood Markets, Inc. is launching Robinhood Money Drills with three new universities, including Florida State University, Coastal Carolina University, and the University of California, Berkeley.

Education 120
article thumbnail

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

databricks

Special thanks to Phillip Jones, Senior Product Manager, and Harshal Brahmbhatt, Systems Engineer from Cloudflare for their contributions to this blog. Organizations across.

article thumbnail

New with Confluent Platform: Seamless Migration Off ZooKeeper, Arm64 Support, and More

Confluent

Confluent Platform 7.6 brings upgrading for existing clusters from ZooKeeper to KRaft, compaction support for Tiered Storage, OAuth (early access), improvements to the Oracle CDC premium connector, and more.

article thumbnail

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building

Data Lake 147
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Simple Precision Time Protocol at Meta

Engineering at Meta

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources. In our own tests, SPTP boasts comparable performance to PTP, but with significant improvements in CPU, memory, and network utilization.

Utilities 122
article thumbnail

Top 5 AI Coding Assistants You Must Try

KDnuggets

Discover the top AI coding assistants that can 10X your productivity overnight - #5 has the best autocomplete feature, and #1 is the most advanced code assistant tool ever seen!

Coding 132
article thumbnail

Access Over 181,000 USGS Historical Topographic Maps

ArcGIS

We recently updated our online USGS historical topographic map collection with over 1,745 new maps for a new total of over 181,000 maps.

article thumbnail

OLMo is Here, Powered by Mosaic AI + Databricks

databricks

As Chief Scientist (Neural Networks) at Databricks, I lead our research team toward the goal of giving everyone the ability to build and.

Building 137
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

IoT Data Streaming for Building Private Wireless Networks

Confluent

Confluent enables real-time, reliable, scalable, and secure communication between IoT devices, applications, and backend systems. Streamline data processing and unlock analytics to boost productivity and time to market while lowering infrastructure costs.

Building 119
article thumbnail

5 Steps to Data Diversity: More Diverse Data Makes for Smarter AI

Snowflake

In an iconic Top Gun scene , Charlie tells Maverick that a maneuver is impossible. Maverick replies, “The data on the MIG is inaccurate.” In the more recent sequel, despite his extensive, firsthand knowledge, Maverick is told “ the future’s coming and you’re not in it. ” While flying may be more automated now, the importance of accurate and diverse data for aviation safety remains — and is likely even more critical.

article thumbnail

DotSlash: Simplified executable deployment

Engineering at Meta

We’ve open sourced DotSlash , a tool that makes large executables available in source control with a negligible impact on repository size, thus avoiding I/O-heavy clone operations. With DotSlash, a set of platform-specific executables is replaced with a single script containing descriptors for the supported platforms. DotSlash handles transparently fetching, decompressing, and verifying the appropriate remote artifact for the current operating system and CPU.

Metadata 121
article thumbnail

26 Data Science Interview Questions You Should Know

KDnuggets

Learn about the most common questions asked during data science interviews. This blog covers non-technical, Python, SQL, statistics, data analysis, and machine learning questions.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Health Care Outside of the Box

Cloudera

How enterprise-grade data management creates better and more efficient care. In the last few years, the acceptance of telehealth has become more widespread as patients and providers found they could maintain continuity through phone and video collaboration, instead of in-person visits. In many cases, a level of care that once required a drive to the clinic or hospital could be delivered over a mobile phone or laptop, with no travel and no waiting room.

Medical 102
article thumbnail

Performance Improvements for Stateful Pipelines in Apache Spark Structured Streaming

databricks

Introduction Apache Spark™ Structured Streaming is a popular open-source stream processing platform that provides scalability and fault tolerance, built on top of the S.

Process 110
article thumbnail

Welcome Noteable: Making Data Streaming Easier and More Approachable

Confluent

Confluent has hired many Noteable employees to help make application development easier for both Kafka and Flink developers.

Kafka 127
article thumbnail

Top 5 Data + AI Predictions for Financial Services in 2024

Snowflake

Generative AI tops every list of major financial services trends for 2024. And it’s no wonder — this new technology has the potential to revolutionize the industry by augmenting the value of employee work, driving organizational efficiencies, providing personalized customer experiences, and uncovering new insights from vast amounts of data. Its predictive capabilities can help leaders anticipate market trends and make more informed decisions, improving financial outcomes for customers as well as

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.