July, 2024

article thumbnail

What are the types of data quality checks?

Start Data Engineering

1. Introduction 2. Data Quality(DQ) checks are run as part of your pipeline 2.1. Ensure your consumers don’t get incorrect data with output DQ checks 2.2. Catch upstream issues quickly with input DQ checks 2.3. Waiting a long time to run output DQ checks? Save time & money with mid-pipeline DQ checks. 2.4. Track incoming and outgoing row counts with Audit logs 3.

Data 214
article thumbnail

PyArrow vs Polars (vs DuckDB) for Data Pipelines.

Confessions of a Data Guy

I’ve had something rattling around in the old noggin for a while; it’s just another strange idea that I can’t quite shake out. We all keep hearing about Arrow this and Arrow that … seems every new tool built today for Data Engineering seems to be at least partly based on Arrow’s in-memory format. So, […] The post PyArrow vs Polars (vs DuckDB) for Data Pipelines. appeared first on Confessions of a Data Guy.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

The software engineering industry in 2024: what changed, why, and what is next

The Pragmatic Engineer

The past 18 months have seen major change reshape the tech industry. What does it all mean for businesses and dev teams – and what will pragmatic software engineering approaches look like in the future? I tackled these burning questions in my conference talk, “What’s Old is New Again,” which was the keynote of the Craft Conference in May 2024.

article thumbnail

Data News — Week 24.30

Christophe Blefari

Tallinn ( credits ) Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe. I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions.

MySQL 130
article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

article thumbnail

DAIS 2024: Testing framework from the Dataflow model for Apache Spark Structured Streaming

Waitingforcode

With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.

Data 130
article thumbnail

Top 8 GenAI Courses for AWS to Take Now

KDnuggets

This article is for anyone looking to maximize their use of Amazon Web Services (AWS) generative AI (GenAI) services. Here are eight courses that range from beginner to expert level.

AWS 142

More Trending

article thumbnail

Databricks on Databricks: Kicking off the Journey to Governance with Unity Catalog

databricks

In this blog, we are excited to share Databricks's journey in migrating to Unity Catalog for enhanced data governance. We'll discuss our high-level strategy and the tools we developed to facilitate the migration. Our goal is to highlight the benefits of Unity Catalog and make you feel confident about transitioning to it.

article thumbnail

Introducing Apache Kafka® 3.8

Confluent

Apache Kafka 3.8 adds 17 new KIPs (13 for Core, 3 for Streams & 1 for Connect). Highlights include 2 new Docker images, the ability to set task assignors, and more!

Kafka 130
article thumbnail

Data News — Week 24.28

Christophe Blefari

EuroSeagull ( credits ) Dear members, it's been a few weeks since I did not catch you on a proper Data News with a collection of links. Here we are. This week, I attended EuroPython in Prague. While I spent most of my time at the dltHub booth in the sponsors hall, I didn't attend many talks. However, I did give a few presentations on my SQL orchestration library, yato , which pairs well with dlt.

Kafka 130
article thumbnail

Data+AI Summit 2024 - Retrospective - Streaming

Waitingforcode

Welcome to the first Data+AI Summit 2024 retrospective blog post. I'm opening the series with the topic close to my heart at the moment, stream processing!

Data 130
article thumbnail

Launching LLM-Based Products: From Concept to Cash in 90 Days

Speaker: Christophe Louvion, Chief Product & Technology Officer of NRC Health and Tony Karrer, CTO at Aggregage

Christophe Louvion, Chief Product & Technology Officer of NRC Health, is here to take us through how he guided his company's recent experience of getting from concept to launch and sales of products within 90 days. In this exclusive webinar, Christophe will cover key aspects of his journey, including: LLM Development & Quick Wins 🤖 Understand how LLMs differ from traditional software, identifying opportunities for rapid development and deployment.

article thumbnail

5 Tips for Improving SQL Query Performance

KDnuggets

If you work in data, you’ll write SQL queries all the time. So how do you write efficient SQL queries that are optimized for performance? This tutorial will help you with just that.

SQL 136
article thumbnail

Data Engineering Weekly #181

Data Engineering Weekly

Editor’s Note: A New Series on Data Engineering Tools Evaluation There are plenty of data tools and vendors in the industry. But how can we choose a tool for the specific need? The traditional evaluation of running PoC on all the selected vendor tools is time-consuming and practically unviable for growth-driven companies. Data Engineering Weekly is launching a new series on software evaluation focused on data engineering to better guide data engineering leaders in evaluating data tools.

article thumbnail

Ingest data from SQL Server, Salesforce, and Workday with LakeFlow Connect

databricks

We’re excited to announce the Public Preview of LakeFlow Connect for SQL Server, Salesforce, and Workday. These ingestion connectors enable simple and efficient.

SQL 132
article thumbnail

16 Ways Insurance Companies Can Use Data and AI

Snowflake

How insurance leaders can use the power of data and AI to transform the industry, from claims analytics to risk selection and beyond There is a growing recognition that insurers can introduce data, analytics and AI into virtually all of the important insurance functions and workflows, including product development, pricing and risk selection, underwriting, claims management, contact center optimization, distribution management, reinsurance, and understanding and shaping customer journeys.

Insurance 113
article thumbnail

How To Speak The Language Of Financial Success In Product Management

Speaker: Jamie Bernard

Success in product management goes beyond delivering great features - it’s about achieving measurable financial outcomes that resonate across the organization. By connecting your product’s journey with the company’s financial success, you’ll ensure that every feature, release, and innovation contributes to the bottom line, driving both customer satisfaction and business growth.

article thumbnail

AI Lab: The secrets to keeping machine learning engineers moving fast

Engineering at Meta

The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A/B test common ML workflows – enabling proactive improvements and automatically preventing regressions on TTFB. AI Lab prevents TTFB regressions whilst enabling experimentation to develop improvements.

article thumbnail

Introducing Cloudera Observability Premium

Cloudera

There’s nothing worse than wasting money on unnecessary costs. In on-premises data estates, these costs appear as wasted person-hours waiting for inefficient analytics to complete, or troubleshooting jobs that have failed to execute as expected, or at all. They manifest as idle hardware waiting for urgent workloads to come in, ensuring sufficient spare capacity to run them amidst noisy neighbors and resource-hungry, lower-priority workloads.

Metadata 102
article thumbnail

Tools Every Data Scientist Should Know: A Practical Guide

KDnuggets

Discover the essential tools every data scientist should know to elevate their data science game, from Python and R to SQL and advanced visualization tools.

article thumbnail

Data Engineering Weekly #182

Data Engineering Weekly

Meta: Introducing Llama 3.1: Our most capable models to date Probability one of the hottest announcements this week is Llama 3.1 release - the first-ever open-sourced frontier AI model competitive with leading foundation models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet. The Llama3 herd of models is an insightful paper that helps one deeply understand the foundational model.

article thumbnail

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.

article thumbnail

Welcoming Prodvana to Databricks: Investing in Next-Gen Infrastructure

databricks

The Prodvana team joins Databricks to support new innovations in the Data Intelligence Platform infrastructure. Learn more about the vision and what's ahead.

Data 134
article thumbnail

Snowflake’s Summer of Sports and AI

Snowflake

All eyes are on sports this summer, with blockbuster events happening in everything from soccer and cycling to cricket and car racing. Snowflake is excited to join the action with a virtual “relay race,” where Snowflake sports and data experts, customers and partners will demonstrate how the sports industry can win big with data and AI. Industry leaders already know that sports runs on data analytics: from individual athlete performance and team statistics, to marketing and fan engagement, to ti

article thumbnail

Story points are pointless by Dave Ogle

Scott Logic

More and more I am of the opinion that putting points against stories is a waste of time. I’ve spent many hours, as I’m sure have you, sitting in meetings of various shapes and sizes guessing numbers and looking back I’m starting to question if it was really worth it. I’ll say upfront, I’m going to be fairly critical of story pointing here, I’m not just being a grumpy old Yorkshireman!

IT 97
article thumbnail

Ripple's Data Evolution: Leveraging Databricks for Next-Gen XRP Ledger Analytics

Ripple Engineering

Introduction: Embracing the Future with Ripple's Data Platform Migration Welcome to a pivotal moment in Ripple's data journey. As leaders at the intersection of blockchain technology and financial services, we're excited to share a transformative step in our data management evolution. We recently embarked on a significant data platform migration, transitioning from Hadoop to Databricks, a move motivated by our relentless pursuit of excellence and our contributions to the XRP Ledge

Hadoop 96
article thumbnail

Provide Real Value in Your Applications with Data and Analytics

The complexity of financial data, the need for real-time insight, and the demand for user-friendly visualizations can seem daunting when it comes to analytics - but there is an easier way. With Logi Symphony, we aim to turn these challenges into opportunities. Our platform empowers you to seamlessly integrate advanced data analytics, generative AI, data visualization, and pixel-perfect reporting into your applications, transforming raw data into actionable insights.

article thumbnail

Landing a Data Engineer Role: Free Courses and Certifications

KDnuggets

Is it possible to learn data engineering for free? I claim it is and present the evidence for that in the form of 10 free data engineering courses.

article thumbnail

Enhanced 3D Layers in ArcGIS

ArcGIS

Esri is working with partners (Maxar, TomTom) to enhance our 3D basemaps with high-quality commercial data for elevation and buildings layers.

Building 106
article thumbnail

Enhancing LLM-as-a-Judge with Grading Notes

databricks

Evaluating long-form LLM outputs quickly and accurately is critical for rapid AI development. As a result, many developers wish to deploy LLM-as-judge methods.

128
128
article thumbnail

Snowflake Expands Leading AI Data Cloud into Global Regulated and Sovereign Markets

Snowflake

Regulated and sovereign markets across the world have stringent requirements stipulating certain important data be kept within geographical borders or even for certain workloads to have dedicated environments, separate from those of other customers. In these markets, organizations need a secure and well-governed data foundation with effective controls to help comply with regulatory requirements.

Cloud 111
article thumbnail

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

article thumbnail

Data Engineering Weekly #180

Data Engineering Weekly

Canva: How Canva collects 25 billion events per day Canva writes about its event collection infrastructure capabilities, handling 25 billion events per day (800 billion events per month) with 99.999% uptime. At our team’s inception, a key decision we made, one we still believe to be a big part of our success, was that every collected event must have a machine-readable, well-documented schema.

article thumbnail

Odin: Uber’s Stateful Platform

Uber Engineering

Explore Odin, Uber’s stateful platform for managing all types of databases. It is a technology-agnostic, intent-based system that has dramatically improved the operational throughput of underlying hosts and databases company-wide.

article thumbnail

The Role of AI in Digital Marketing

KDnuggets

Artificial intelligence (AI) has revolutionized numerous sectors, including digital marketing. This field leverages online platforms to promote products and services.

137
137
article thumbnail

Explore BlueBikes ride data with ArcGIS Pro Charts

ArcGIS

In this blog article, we'll explore BlueBikes data, a bike share service in bustling Boston, and uncover hidden insights through the power of visualization

Data 89
article thumbnail

The AI Superhero Approach to Product Management

Speaker: Conrado Morlan

In this engaging and witty talk, industry expert Conrado Morlan will explore how artificial intelligence can transform the daily tasks of product managers into streamlined, efficient processes. Using the lens of a superhero narrative, he’ll uncover how AI can be the ultimate sidekick, aiding in data management and reporting, enhancing productivity, and boosting innovation.