Sat.Jul 20, 2024 - Fri.Jul 26, 2024

article thumbnail

How to implement data quality checks with greatexpectations

Start Data Engineering

1. Introduction 2. Project overview 3. Check your data before making it available to end-users; Write-Audit-Publish(WAP) pattern 4. TL;DR: How the greatexpectations library works 4.1. greatexpectations quick setup 5. From an implementation perspective, there are four types of tests 5.1. Running checks on one dataset 5.2. Checks involving the current dataset and its historical data 5.3.

Datasets 208
article thumbnail

PyArrow vs Polars (vs DuckDB) for Data Pipelines.

Confessions of a Data Guy

I’ve had something rattling around in the old noggin for a while; it’s just another strange idea that I can’t quite shake out. We all keep hearing about Arrow this and Arrow that … seems every new tool built today for Data Engineering seems to be at least partly based on Arrow’s in-memory format. So, […] The post PyArrow vs Polars (vs DuckDB) for Data Pipelines. appeared first on Confessions of a Data Guy.

article thumbnail

5 Tools Every Data Scientist Needs in Their Toolbox in 2024

KDnuggets

From the soft tools to the hard tools, these are what make a data scientist successful.

Data 154
article thumbnail

A New Standard in Open Source AI: Meta Llama 3.1 on Databricks

databricks

We are excited to partner with Meta to release the Llama 3.1 series of models on Databricks, further advancing the standard of powerful.

145
145
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Data News — Week 24.30

Christophe Blefari

Tallinn ( credits ) Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe. I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions.

MySQL 130
article thumbnail

Data Engineering Weekly #181

Data Engineering Weekly

Editor’s Note: A New Series on Data Engineering Tools Evaluation There are plenty of data tools and vendors in the industry. But how can we choose a tool for the specific need? The traditional evaluation of running PoC on all the selected vendor tools is time-consuming and practically unviable for growth-driven companies. Data Engineering Weekly is launching a new series on software evaluation focused on data engineering to better guide data engineering leaders in evaluating data tools.

More Trending

article thumbnail

Enhancing LLM-as-a-Judge with Grading Notes

databricks

Evaluating long-form LLM outputs quickly and accurately is critical for rapid AI development. As a result, many developers wish to deploy LLM-as-judge methods.

143
143
article thumbnail

Snowflake Cortex Search: State-of-the-Art Hybrid Search for RAG Applications

Snowflake

Snowflake Cortex Search, a fully managed search service for documents and other unstructured data, is now in public preview. With Cortex Search, organizations can effortlessly deploy retrieval-augmented generation (RAG) applications with Snowflake, powering use cases like customer service, financial research and sales chatbots. Cortex Search offers state-of-the-art semantic and lexical search over your text data in Snowflake behind an intuitive user interface, and it comes with the robust securi

article thumbnail

Maestro: Netflix’s Workflow Orchestrator

Netflix Tech

By Jun He , Natallia Dzenisenka , Praneeth Yenugutala , Yingyi Zhang , and Anjali Norwood TL;DR We are thrilled to announce that the Maestro source code is now open to the public! Please visit the Maestro GitHub repository to get started. If you find it useful, please give us a star. What is Maestro Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines.

article thumbnail

Learn Data Analysis with Julia

KDnuggets

Setup the environment, load the data, perform data analysis and visualization, and create the data pipeline all using Julia programming language.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Databricks on Databricks: Kicking off the Journey to Governance with Unity Catalog

databricks

In this blog, we are excited to share Databricks's journey in migrating to Unity Catalog for enhanced data governance. We'll discuss our high-level strategy and the tools we developed to facilitate the migration. Our goal is to highlight the benefits of Unity Catalog and make you feel confident about transitioning to it.

article thumbnail

Node.js and the tale of worker threads

Zalando Engineering

A disrupted gaming night I do not usually read code when dealing with production incidents, as it is one of the slower ways to understand and mitigate what is happening. But on that Friday night, I was glad I did. I was about to start another session of Elden Ring (a video game in which everything is pretty much trying to kill the player) when I was paged with the following: "campaign service is consuming all resources we throw at it".

Coding 102
article thumbnail

Resilience in Action: How Cloudera’s Platform, and Data in Motion Solutions, Stayed Strong Amid the CrowdStrike Outage

Cloudera

Late last week, the tech world witnessed a significant disruption caused by a faulty update from CrowdStrike, a cybersecurity software company that focuses on protecting endpoints, cloud workloads, identity, and data. This update led to global IT outages, severely affecting various sectors such as banking, airlines, and healthcare. Many organizations found their systems rendered inoperative, highlighting the critical importance of system resilience and reliability.

article thumbnail

Visualizing Data: A Statology Primer

KDnuggets

This collection of tutorials from our sister site Statology center on data visualization. Learn more about visualizing your data right here.

Data 144
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Introducing Mosaic AI Model Training for Fine-Tuning GenAI Models

databricks

Today, we're thrilled to announce that Mosaic AI Model Training's support for fine-tuning GenAI models is now available in Public Preview. At Databricks.

article thumbnail

Introducing Joint Investing Accounts at Robinhood

Robinhood

Today, we are excited to launch joint investing accounts, which allow customers to seamlessly manage investments with their partner while keeping their shared assets in one place. Joint accounts make investing more collaborative for families and loved ones, providing shared access for account holders that allows them to combine funds and increase their investment power as they work towards their financial goals.

Banking 89
article thumbnail

How Snowflake Accelerates Business Growth for Providers of Data, Apps and AI Products 

Snowflake

Let’s say you are building a house that you plan to put up for sale. You focus on an amazing design, beautiful entry, large windows for plenty of sunlight — things that will create a delightful experience for your future buyer. At the same time, the house also needs less glamorous but vitally important infrastructure, like plumbing, running water, electricity, heating, cooling and so on.

article thumbnail

How to Use Conditional Formatting in Pandas to Enhance Data Visualization

KDnuggets

Tired of staring at bland dataframes? Discover how conditional formatting in Pandas can transform your data visualization experience!

Data 142
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Primary Key and Foreign Key constraints are GA and now enable faster queries

databricks

Dataricks is thrilled to announce the General Availability (GA) of Primary Key (PK) and Foreign Key (FK) constraints, starting in Databricks Runtime 15.2.

article thumbnail

Zero Downtime Upgrades – Redefining Your Platform Upgrade Experience

Cloudera

Cloudera recently unveiled the latest version of Cloudera Private Cloud Base with the Zero Downtime Upgrade (ZDU) feature to enhance your user experience. The goal of ZDU is to make upgrades simpler for you and your stakeholders by increasing the availability of Cloudera’s services. How Do You Keep IT Infrastructure (and Buses) Running and Avoid Downtime?

article thumbnail

Odin: Uber’s Stateful Platform

Uber Engineering

Explore Odin, Uber’s stateful platform for managing all types of databases. It is a technology-agnostic, intent-based system that has dramatically improved the operational throughput of underlying hosts and databases company-wide.

article thumbnail

How to Use the pivot_table Function for Advanced Data Summarization in Pandas

KDnuggets

Let's learn to use Pandas pivot_table in Python to perform advance data summarization

Python 137
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

A Framework for Multi-Model Forecasting on Databricks

databricks

Introduction Time series forecasting serves as the foundation for inventory and demand management in most enterprises. Using data from past periods along with.

article thumbnail

Accelerate your data streaming journey with the latest in Confluent Cloud

Confluent

CC 2024 Q2 adds Flink Private Networking (AWS), Flink SQL Interactive Tables; Enterprise:Connect w/Confluent, Connector Custom Offsets; SI: Build w/Confluent, etc.

Cloud 69
article thumbnail

Modern Enterprise Data Modeling

Towards Data Science

How to address the shortcomings of shallow, outdated models and future-proof your modeling strategy Continue reading on Towards Data Science »

article thumbnail

How To Navigate the Filesystem with Python’s Pathlib

KDnuggets

Learn how to navigate and manage your filesystem with Python's built-in pathlib module.

Python 134
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Building Industry IoT and M2M Solutions With Databricks for Communications

databricks

The communications industry is experiencing immense change due to rapid technological advancements and evolving market trends. Communications service providers (CSP) build various solutions.

Building 115
article thumbnail

Pickup in 3 minutes: Uber’s implementation of Live Activity on iOS

Uber Engineering

From WWDC reveal to delivery, discover how we tackled new tech, design challenges, and tight timelines to enhance rider & driver experiences with Live Activity® from Apple.

article thumbnail

Radical Simplicity in Data Engineering

Towards Data Science

Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinking source: unsplash.com Recently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The main pain points I heard time and time again were: Not knowing why something broke Getting burnt with high cloud compute costs Taking too long to build data solutions/complete data projects Needing expertise on many tools and technologies

article thumbnail

Using Transfer Learning to Boost Model Performance

KDnuggets

Transfer learning can improve model performance by leveraging pre-trained models and adapting them to new, related tasks.

article thumbnail

The Cloud Development Environment Adoption Report

Cloud Development Environments (CDEs) are changing how software teams work by moving development to the cloud. Our Cloud Development Environment Adoption Report gathers insights from 223 developers and business leaders, uncovering key trends in CDE adoption. With 66% of large organizations already using CDEs, these platforms are quickly becoming essential to modern development practices.