April, 2025

article thumbnail

How To Set Up Your Data Infrastructure In 2025 – Part 1

Seattle Data Guy

Planning out your data infrastructure in 2025 can feel wildly different than it did even five years ago. The ecosystem is louder, flashier, and more fragmented. Everyone is talking about AI, chatbots, LLMs, vector databases, and whether your data stack is “AI-ready.” Vendors promise magic, just plug in their tool and watch your insights appear.… Read more The post How To Set Up Your Data Infrastructure In 2025 Part 1 appeared first on Seattle Data Guy.

Database 147
article thumbnail

Cloudflare R2 Storage with Apache Iceberg

Confessions of a Data Guy

Rethinking Object Storage: A First Look at CloudflareR2 and Its BuiltIn ApacheIceberg Catalog Sometimes, we follow tradition because, well, it worksuntil something new comes along and makes us question the status quo. For many of us, AmazonS3 is that welltrodden path: the backbone of our data platforms and pipelines, used countless times each day. If […] The post Cloudflare R2 Storage with Apache Iceberg appeared first on Confessions of a Data Guy.

IT 130
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

dbt is the standard for creating governed, trustworthy datasets on top of your structured data. MCP is showing increasing promise as the standard for providing context to LLMs to allow them to function at a high level in real world, operational scenarios. Today, we are open sourcing an experimental version of the dbt MCP server. We expect that over the coming years, structured data is going to become heavily integrated into AI workflows and that dbt will play a key role in building and provision

article thumbnail

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew. The data warehouse solved for performance and scale but, much like the databases that preceded it, relied on proprietary formats to build vertically integrated systems.

article thumbnail

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

article thumbnail

How Meta understands data at scale

Engineering at Meta

Managing and understanding large-scale data ecosystems is a significant challenge for many organizations, requiring innovative solutions to efficiently safeguard user data. Meta’s vast and diverse systems make it particularly challenging to comprehend its structure, meaning, and context at scale. To address these challenges, we made substantial investments in advanced data understanding technologies, as part of our Privacy Aware Infrastructure (PAI).

article thumbnail

How Netflix Accurately Attributes eBPF Flow Logs

Netflix Tech

By Cheng Xie , Bryan Shultz , and Christine Xu In a previous blog post , we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced network insights. In this post, we delve deeper into how Netflix solved a core problem: accurately attributing flow IP addresses to workload identities. A BriefRecap FlowExporter is a sidecar that runs alongside all Netflix workloads.

AWS 75

More Trending

article thumbnail

Data Engineering Weekly #218

Data Engineering Weekly

Try Apache Airflow® 3 on Astro Airflow 3 is here and has never been easier to use or more secure. Spin up a new 3.0 deployment on Astro to test DAG versioning, backfills, event-driven scheduling, and more. Get started → Chip Huyen: Exploring three strategies - functional correctness, AI-as-a-judge, and comparative evaluation As AI development becomes mainstream, so does the need to adopt all the best practices in software engineering.

article thumbnail

Spotter: Your AI Analyst

ThoughtSpot

Loved by Business Leaders, Trusted by Analysts Last year, we introduced Spotter our AI analyst that delivers agentic data experiences with enterprise-grade trust and scale. Today, were delivering several key innovations that will help you streamline insights-to-actions with agentic analytics, crossing a major milestone on our path to enabling an autonomous business.

BI 59
article thumbnail

Snowflake Startup Challenge 2025: Meet the Top 10

Snowflake

The traditional five-year anniversary gift is wood. Since snowboards often have a wooden core, and because a snowboard is the traditional trophy for the Snowflake Startup Challenge, were going to go ahead and say that the snowboard trophy qualifies as a present for the fifth anniversary of our Startup Challenge. The only difference is that instead of receiving the gift, well be giving it to one of the 10 semifinalists listed below!

article thumbnail

Meta Open Source: 2024 by the numbers

Engineering at Meta

Open source has played an essential role in the tech industry and beyond. Whether in the AI/ML, web, or mobile space, our open source community grew and evolved while connecting people worldwide. At Meta Open Source , 2024 was a year of growth and transformation. Our open source initiatives addressed the evolving needs and challenges of developerspowering breakthroughs in AI and enabling the creation of innovative, user-focused applications and experiences.

article thumbnail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

article thumbnail

How to leverage business intelligence in retail industry

InData Labs

The retail sector is among the most competitive markets, making it exceptionally difficult for businesses to not only thrive but even survive. Business intelligence in retail industry can be a colossal game changer for organizations struggling to compete. BI for retail allows companies to leverage Big data analytics and machine learning techniques to extract valuable.

article thumbnail

What Is BigQuery And How Do You Load Data Into It?

Seattle Data Guy

If you work in data, then youve likely used BigQuery and youve likely used it without really thinking about how it operates under the hood. On the surface BigQuery is Google Clouds fully-managed, serverless data warehouse. Its the Redshift of GCP except we like it a little more. The question becomes, how does it work?… Read more The post What Is BigQuery And How Do You Load Data Into It?

IT 130
article thumbnail

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Precisely

A former colleague recently asked me to explain my role at Precisely. After my (admittedly lengthy) explanation of what I do as the EVP and GM of our Enrich business, she summarized it in a very succinct, but new way: “Oh, you manage the appending datasets.” That got me thinking. We often use different terms when were talking about the same thing in this case, data appending vs. data enrichment.

Retail 75
article thumbnail

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Edureka

Selecting the appropriate data platform becomes crucial as businesses depend more and more on data to inform their decisions. Although they take quite different approaches, Microsoft Fabric and Snowflake, two of the top players in the current data landscape, both provide strong capabilities. Understanding how these platforms compare can assist you in selecting the best option for your company, regardless of your role as a data engineer, business analyst, or decision-maker.

BI 52
article thumbnail

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

article thumbnail

Snowflake Startup Spotlight: Innova-Q

Snowflake

Welcome to Snowflakes Startup Spotlight, where we learn about amazing companies building businesses on Snowflake. This time, were casting the spotlight on Innova-Q , where the founders are stirring things up in the food and beverage industry. With the power of modern generative AI, theyre improving product safety, streamlining operations and simplifying regulatory compliance.

Food 90
article thumbnail

Data quality on Databricks - Spark Expectations

Waitingforcode

Previously we learned how to control data quality with Delta Live Tables. Now, it's time to see an open source library in action, Spark Expectations.

Data 147
article thumbnail

Why Data Quality Isn’t Worth The Effort: Data Quality Coffee With Uncle Chip

DataKitchen

Why Data Quality Isnt Worth The Effort : Data Quality Coffee With Uncle Chip Data quality has become one of the most discussed challenges in modern data teams, yet it remains one of the most thankless and frustrating responsibilities. In the first of the Data Quality Coffee With Uncle Chip series, he highlights the persistent tension between the need for clean, reliable data and its overwhelming complexity.

Data 67
article thumbnail

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

Jia Zhan, Senior Staff Software Engineer, Pinterest Sachin Holla, Principal Solution Architect, AWS Summary Pinterest is a visual search engine and powers over 550 million monthly active users globally. Pinterests infrastructure runs on AWS and leverages Amazon EC2 instances for its compute fleet. In recent years, while managing Pinterests EC2 infrastructure, particularly for our essential online storage systems, we identified a significant challenge: the lack of clear insights into EC2s network

AWS 57
article thumbnail

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

article thumbnail

Snowflake Data Quality Framework: Validate, Monitor, and Trust Your Data

Cloudyard

Read Time: 2 Minute, 3 Second In todays cloud-first landscape, the integrity of data pipelines is crucial for operational success, regulatory compliance, and business decision-making. This blog, “Snowflake Data Quality Framework: Validate, Monitor, and Trust Your Data,” will walk you through a Snowflake-native, dynamic, and extensible Data Quality (DQ) Framework capable of automatically validating data pipelines, logging results, and monitoring anomalies in near real-time.

article thumbnail

Top 10 Data Engineering Trends in 2025

Edureka

Data is more than simply numbers as we approach 2025; it serves as the foundation for business decision-making in all sectors. However, data alone is insufficient. To remain competitive in the current digital environment, businesses must effectively gather, handle, and manage it. Data engineering can help with it. It is the force behind seamless data flow, enabling everything from AI-driven automation to real-time analytics.

article thumbnail

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Snowflake

Snowflake Cortex AI now features native multimodal AI capabilities, eliminating data silos and the need for separate, expensive tools. Introducing Cortex AI COMPLETE Multimodal , now in public preview. This major enhancement brings the power to analyze images and other unstructured data directly into Snowflakes query engine, using familiar SQL at scale.

article thumbnail

The Best Data Dictionary Tools in 2025

Monte Carlo

Different teams love using the same data in totally different ways. Eventually, it gets to the point where everyone has their own secret nickname for the same customer fieldlike Sales calling it cust_id, while Marketing goes with user_ref. And yeah… thats kind of a problem. Thats where data dictionary tools come in. A data dictionary tool helps define and organize your data so everyones speaking the same language.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Data Science Side Quests: 4 Uncommon Projects to Elevate Your Skills

KDnuggets

Doing data science projects can be demanding, but it doesnt mean it has to be boring. Here are four projects to introduce more fun to your learning and stand out from the masses.

article thumbnail

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

Han Wang | Machine Learning Engineer II, Relevance & Query Understanding; Mukuntha Narayanan | Machine Learning Engineer II, Relevance & Query Understanding; Onur Gungor | (former) Staff Machine Learning Engineer, Relevance & Query Understanding; Jinfeng Rao | Senior Staff Machine Learning Engineer, Pinner Discovery Figure: Illustration of the search relevance system at Pinterest.

article thumbnail

AI and Data in Production: Insights from Avinash Narasimha [AI Solutions Leader at Koch Industries]

Data Engineering Weekly

In our latest episode of Data Engineering Weekly, co-hosted by Aswin, we explored the practical realities of AI deployment and data readiness with our distinguished guest, Avinash Narasimha, Former AI Solutions Leader at Koch Industries. This discussion shed significant light on the maturity, challenges, and potential that generative AI and data preparedness present in contemporary enterprises.

article thumbnail

Platform as a Service (PaaS)

WeCloudData

PaaS is a fundamental cloud computing model that offers developers and organizations a robust environment for building, deploying, and managing applications efficiently. This blog provides detailed information on data Platform as a Service (PaaS),, how it differs from other cloud computing models, its working principles, and its benefits. Lets get started and explore PaaS with […] The post Platform as a Service (PaaS) appeared first on WeCloudData.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

The Future of Data Management Is Agentic AI

Snowflake

Managing and utilizing data effectively is crucial for organizational success in today's fast-paced technological landscape. The vast amounts of data generated daily require advanced tools for efficient management and analysis. Enter agentic AI, a type of artificial intelligence set to transform enterprise data management. As the Snowflake CTO at Deloitte, I have seen the powerful impact of these technologies, especially when leveraging the combined experience of the Deloitte and Snowflake allia

article thumbnail

How to Extract Data from APIs for Data Pipelines using Python

Start Data Engineering

1. Introduction 2. APIs are a way to communicate between systems on the Internet 2.1. HTTP is a protocol commonly used for websites 2.1.1. Request: Ask the Internet exactly what you want 2.1.2. Response is what you get from the server 3. API Data extraction = GET-ting data from a server 3.1. GET data 3.1.1. GET data for a specific entity 3.

article thumbnail

AI Con USA 2025: An Intelligence-Driven Future

KDnuggets

AI Con USA, the premier event for artificial intelligence and machine learning professionals, is set to take place from June 813, 2025.

article thumbnail

The Power of Fine-Tuning on Your Data: Quick Fixing Bugs with LLMs via Never Ending Learning (NEL)

databricks

Summary: LLMs have revolutionized software development by increasing the productivity of programmers.

Coding 132
article thumbnail

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!