Sat.Jul 20, 2024 - Fri.Jul 26, 2024

article thumbnail

How to implement data quality checks with greatexpectations

Start Data Engineering

1. Introduction 2. Project overview 3. Check your data before making it available to end-users; Write-Audit-Publish(WAP) pattern 4. TL;DR: How the greatexpectations library works 4.1. greatexpectations quick setup 5. From an implementation perspective, there are four types of tests 5.1. Running checks on one dataset 5.2. Checks involving the current dataset and its historical data 5.3.

Datasets 208
article thumbnail

PyArrow vs Polars (vs DuckDB) for Data Pipelines.

Confessions of a Data Guy

I’ve had something rattling around in the old noggin for a while; it’s just another strange idea that I can’t quite shake out. We all keep hearing about Arrow this and Arrow that … seems every new tool built today for Data Engineering seems to be at least partly based on Arrow’s in-memory format. So, […] The post PyArrow vs Polars (vs DuckDB) for Data Pipelines. appeared first on Confessions of a Data Guy.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

5 Tools Every Data Scientist Needs in Their Toolbox in 2024

KDnuggets

From the soft tools to the hard tools, these are what make a data scientist successful.

Data 153
article thumbnail

A New Standard in Open Source AI: Meta Llama 3.1 on Databricks

databricks

We are excited to partner with Meta to release the Llama 3.1 series of models on Databricks, further advancing the standard of powerful.

145
145
article thumbnail

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

article thumbnail

Snowflake Cortex Search: State-of-the-Art Hybrid Search for RAG Applications

Snowflake

Snowflake Cortex Search, a fully managed search service for documents and other unstructured data, is now in public preview. With Cortex Search, organizations can effortlessly deploy retrieval-augmented generation (RAG) applications with Snowflake, powering use cases like customer service, financial research and sales chatbots. Cortex Search offers state-of-the-art semantic and lexical search over your text data in Snowflake behind an intuitive user interface, and it comes with the robust securi

article thumbnail

Data News — Week 24.30

Christophe Blefari

Tallinn ( credits ) Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe. I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions.

MySQL 130

More Trending

article thumbnail

Enhancing LLM-as-a-Judge with Grading Notes

databricks

Evaluating long-form LLM outputs quickly and accurately is critical for rapid AI development. As a result, many developers wish to deploy LLM-as-judge methods.

143
143
article thumbnail

Getting the Most From Your Modern Data Platform: A Three-Phase Approach

Snowflake

A robust, modern data platform is the starting point for your organization’s data and analytics vision. At first, you may use your modern data platform as a single source of truth to realize operational gains — but you can realize far greater benefits by adding additional use cases. In this blog, we offer guidance for leveraging Snowflake’s capabilities around data and AI to build apps and unlock innovation.

article thumbnail

Data Engineering Weekly #181

Data Engineering Weekly

Editor’s Note: A New Series on Data Engineering Tools Evaluation There are plenty of data tools and vendors in the industry. But how can we choose a tool for the specific need? The traditional evaluation of running PoC on all the selected vendor tools is time-consuming and practically unviable for growth-driven companies. Data Engineering Weekly is launching a new series on software evaluation focused on data engineering to better guide data engineering leaders in evaluating data tools.

article thumbnail

Learn Data Analysis with Julia

KDnuggets

Setup the environment, load the data, perform data analysis and visualization, and create the data pipeline all using Julia programming language.

article thumbnail

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

article thumbnail

Introducing Mosaic AI Model Training for Fine-Tuning GenAI Models

databricks

Today, we're thrilled to announce that Mosaic AI Model Training's support for fine-tuning GenAI models is now available in Public Preview. At Databricks.

article thumbnail

How Snowflake Accelerates Business Growth for Providers of Data, Apps and AI Products 

Snowflake

Let’s say you are building a house that you plan to put up for sale. You focus on an amazing design, beautiful entry, large windows for plenty of sunlight — things that will create a delightful experience for your future buyer. At the same time, the house also needs less glamorous but vitally important infrastructure, like plumbing, running water, electricity, heating, cooling and so on.

article thumbnail

Mainframe History: How Mainframe Computers Have Changed Over the Years

Precisely

Mainframes have one of the longest histories of any kind of computing technology that is still used today. In fact, mainframe history shows the fast-evolving landscape of technology, few innovations have left as profound a mark as mainframe computers. From their inception to the sophisticated systems of today, mainframes have continuously adapted to meet the ever-growing demands of business operations.

article thumbnail

How to Use Conditional Formatting in Pandas to Enhance Data Visualization

KDnuggets

Tired of staring at bland dataframes? Discover how conditional formatting in Pandas can transform your data visualization experience!

Data 140
article thumbnail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

article thumbnail

Databricks on Databricks: Kicking off the Journey to Governance with Unity Catalog

databricks

In this blog, we are excited to share Databricks's journey in migrating to Unity Catalog for enhanced data governance. We'll discuss our high-level strategy and the tools we developed to facilitate the migration. Our goal is to highlight the benefits of Unity Catalog and make you feel confident about transitioning to it.

article thumbnail

Introducing Joint Investing Accounts at Robinhood

Robinhood

Today, we are excited to launch joint investing accounts, which allow customers to seamlessly manage investments with their partner while keeping their shared assets in one place. Joint accounts make investing more collaborative for families and loved ones, providing shared access for account holders that allows them to combine funds and increase their investment power as they work towards their financial goals.

Banking 98
article thumbnail

Odin: Uber’s Stateful Platform

Uber Engineering

Explore Odin, Uber’s stateful platform for managing all types of databases. It is a technology-agnostic, intent-based system that has dramatically improved the operational throughput of underlying hosts and databases company-wide.

article thumbnail

Visualizing Data: A Statology Primer

KDnuggets

This collection of tutorials from our sister site Statology center on data visualization. Learn more about visualizing your data right here.

Data 138
article thumbnail

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

article thumbnail

Primary Key and Foreign Key constraints are GA and now enable faster queries

databricks

Dataricks is thrilled to announce the General Availability (GA) of Primary Key (PK) and Foreign Key (FK) constraints, starting in Databricks Runtime 15.2.

article thumbnail

Celebrating Empowerment: Robinhood Market’s Women in Tech Conference 2024

Robinhood

Robinhood was founded on a simple idea: that our financial markets should be accessible to all. With customers at the heart of our decisions, Robinhood is lowering barriers and providing greater access to financial information and investing. Together, we are building products and services that help create a financial system everyone can participate in. … Recently, Robinhood Markets hosted its highly anticipated Annual Women in Tech (WIT) Conference, a day-long event designed to empower and inspi

Finance 82
article thumbnail

Zero Downtime Upgrades – Redefining Your Platform Upgrade Experience

Cloudera

Cloudera recently unveiled the latest version of Cloudera Private Cloud Base with the Zero Downtime Upgrade (ZDU) feature to enhance your user experience. The goal of ZDU is to make upgrades simpler for you and your stakeholders by increasing the availability of Cloudera’s services. How Do You Keep IT Infrastructure (and Buses) Running and Avoid Downtime?

article thumbnail

How to Use the pivot_table Function for Advanced Data Summarization in Pandas

KDnuggets

Let's learn to use Pandas pivot_table in Python to perform advance data summarization

Python 129
article thumbnail

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

article thumbnail

A Framework for Multi-Model Forecasting on Databricks

databricks

Introduction Time series forecasting serves as the foundation for inventory and demand management in most enterprises. Using data from past periods along with.

article thumbnail

Radical Simplicity in Data Engineering

Towards Data Science

Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinking source: unsplash.com Recently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The main pain points I heard time and time again were: Not knowing why something broke Getting burnt with high cloud compute costs Taking too long to build data solutions/complete data projects Needing expertise on many tools and technologies

article thumbnail

Accelerate your data streaming journey with the latest in Confluent Cloud

Confluent

CC 2024 Q2 adds Flink Private Networking (AWS), Flink SQL Interactive Tables; Enterprise:Connect w/Confluent, Connector Custom Offsets; SI: Build w/Confluent, etc.

Cloud 69
article thumbnail

How To Navigate the Filesystem with Python’s Pathlib

KDnuggets

Learn how to navigate and manage your filesystem with Python's built-in pathlib module.

Python 126
article thumbnail

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Building Industry IoT and M2M Solutions With Databricks for Communications

databricks

The communications industry is experiencing immense change due to rapid technological advancements and evolving market trends. Communications service providers (CSP) build various solutions.

Building 116
article thumbnail

Handling Hierarchies in Dimensional Modeling

Towards Data Science

Hierarchies play a crucial role in dimensional modeling for Data Warehouses. See my recommendations on how to handle them effectively.

article thumbnail

Pickup in 3 minutes: Uber’s implementation of Live Activity on iOS

Uber Engineering

From WWDC reveal to delivery, discover how we tackled new tech, design challenges, and tight timelines to enhance rider & driver experiences with Live Activity® from Apple.

article thumbnail

How to Handle Time Zones and Timestamps Accurately with Pandas

KDnuggets

Learn how to handling the time-zone and timestamps in Pandas with Python.

Python 123
article thumbnail

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m