Sat.Jul 20, 2024 - Fri.Jul 26, 2024

article thumbnail

How to implement data quality checks with greatexpectations

Start Data Engineering

1. Introduction 2. Project overview 3. Check your data before making it available to end-users; Write-Audit-Publish(WAP) pattern 4. TL;DR: How the greatexpectations library works 4.1. greatexpectations quick setup 5. From an implementation perspective, there are four types of tests 5.1. Running checks on one dataset 5.2. Checks involving the current dataset and its historical data 5.3.

Datasets 209
article thumbnail

PyArrow vs Polars (vs DuckDB) for Data Pipelines.

Confessions of a Data Guy

I’ve had something rattling around in the old noggin for a while; it’s just another strange idea that I can’t quite shake out. We all keep hearing about Arrow this and Arrow that … seems every new tool built today for Data Engineering seems to be at least partly based on Arrow’s in-memory format. So, […] The post PyArrow vs Polars (vs DuckDB) for Data Pipelines. appeared first on Confessions of a Data Guy.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Data News — Week 24.30

Christophe Blefari

Tallinn ( credits ) Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe. I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions.

MySQL 130
article thumbnail

Databricks on Databricks: Kicking off the Journey to Governance with Unity Catalog

databricks

In this blog, we are excited to share Databricks's journey in migrating to Unity Catalog for enhanced data governance. We'll discuss our high-level strategy and the tools we developed to facilitate the migration. Our goal is to highlight the benefits of Unity Catalog and make you feel confident about transitioning to it.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Data Engineering Weekly #181

Data Engineering Weekly

Editor’s Note: A New Series on Data Engineering Tools Evaluation There are plenty of data tools and vendors in the industry. But how can we choose a tool for the specific need? The traditional evaluation of running PoC on all the selected vendor tools is time-consuming and practically unviable for growth-driven companies. Data Engineering Weekly is launching a new series on software evaluation focused on data engineering to better guide data engineering leaders in evaluating data tools.

article thumbnail

Snowflake Cortex Search: State-of-the-Art Hybrid Search for RAG Applications

Snowflake

Snowflake Cortex Search, a fully managed search service for documents and other unstructured data, is now in public preview. With Cortex Search, organizations can effortlessly deploy retrieval-augmented generation (RAG) applications with Snowflake, powering use cases like customer service, financial research and sales chatbots. Cortex Search offers state-of-the-art semantic and lexical search over your text data in Snowflake behind an intuitive user interface, and it comes with the robust securi

More Trending

article thumbnail

A New Standard in Open Source AI: Meta Llama 3.1 on Databricks

databricks

We are excited to partner with Meta to release the Llama 3.1 series of models on Databricks, further advancing the standard of powerful.

138
138
article thumbnail

Node.js and the tale of worker threads

Zalando Engineering

A disrupted gaming night I do not usually read code when dealing with production incidents, as it is one of the slower ways to understand and mitigate what is happening. But on that Friday night, I was glad I did. I was about to start another session of Elden Ring (a video game in which everything is pretty much trying to kill the player) when I was paged with the following: "campaign service is consuming all resources we throw at it".

Coding 102
article thumbnail

Maestro: Netflix’s Workflow Orchestrator

Netflix Tech

By Jun He , Natallia Dzenisenka , Praneeth Yenugutala , Yingyi Zhang , and Anjali Norwood TL;DR We are thrilled to announce that the Maestro source code is now open to the public! Please visit the Maestro GitHub repository to get started. If you find it useful, please give us a star. What is Maestro Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines.

article thumbnail

Introducing Joint Investing Accounts at Robinhood

Robinhood

Today, we are excited to launch joint investing accounts, which allow customers to seamlessly manage investments with their partner while keeping their shared assets in one place. Joint accounts make investing more collaborative for families and loved ones, providing shared access for account holders that allows them to combine funds and increase their investment power as they work towards their financial goals.

Banking 91
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Enhancing LLM-as-a-Judge with Grading Notes

databricks

Evaluating long-form LLM outputs quickly and accurately is critical for rapid AI development. As a result, many developers wish to deploy LLM-as-judge methods.

127
127
article thumbnail

Learn Data Analysis with Julia

KDnuggets

Setup the environment, load the data, perform data analysis and visualization, and create the data pipeline all using Julia programming language.

article thumbnail

Resilience in Action: How Cloudera’s Platform, and Data in Motion Solutions, Stayed Strong Amid the CrowdStrike Outage

Cloudera

Late last week, the tech world witnessed a significant disruption caused by a faulty update from CrowdStrike, a cybersecurity software company that focuses on protecting endpoints, cloud workloads, identity, and data. This update led to global IT outages, severely affecting various sectors such as banking, airlines, and healthcare. Many organizations found their systems rendered inoperative, highlighting the critical importance of system resilience and reliability.

article thumbnail

How Snowflake Accelerates Business Growth for Providers of Data, Apps and AI Products 

Snowflake

Let’s say you are building a house that you plan to put up for sale. You focus on an amazing design, beautiful entry, large windows for plenty of sunlight — things that will create a delightful experience for your future buyer. At the same time, the house also needs less glamorous but vitally important infrastructure, like plumbing, running water, electricity, heating, cooling and so on.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Introducing Mosaic AI Model Training for Fine-Tuning GenAI Models

databricks

Today, we're thrilled to announce that Mosaic AI Model Training's support for fine-tuning GenAI models is now available in Public Preview. At Databricks.

article thumbnail

5 Tools Every Data Scientist Needs in Their Toolbox in 2024

KDnuggets

From the soft tools to the hard tools, these are what make a data scientist successful.

Data 133
article thumbnail

Zero Downtime Upgrades – Redefining Your Platform Upgrade Experience

Cloudera

Cloudera recently unveiled the latest version of Cloudera Private Cloud Base with the Zero Downtime Upgrade (ZDU) feature to enhance your user experience. The goal of ZDU is to make upgrades simpler for you and your stakeholders by increasing the availability of Cloudera’s services. How Do You Keep IT Infrastructure (and Buses) Running and Avoid Downtime?

article thumbnail

Odin: Uber’s Stateful Platform

Uber Engineering

Explore Odin, Uber’s stateful platform for managing all types of databases. It is a technology-agnostic, intent-based system that has dramatically improved the operational throughput of underlying hosts and databases company-wide.

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

article thumbnail

A Framework for Multi-Model Forecasting on Databricks

databricks

Introduction Time series forecasting serves as the foundation for inventory and demand management in most enterprises. Using data from past periods along with.

article thumbnail

Visualizing Data: A Statology Primer

KDnuggets

This collection of tutorials from our sister site Statology center on data visualization. Learn more about visualizing your data right here.

Data 109
article thumbnail

Accelerate your data streaming journey with the latest in Confluent Cloud

Confluent

CC 2024 Q2 adds Flink Private Networking (AWS), Flink SQL Interactive Tables; Enterprise:Connect w/Confluent, Connector Custom Offsets; SI: Build w/Confluent, etc.

Cloud 69
article thumbnail

Radical Simplicity in Data Engineering

Towards Data Science

Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinking source: unsplash.com Recently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The main pain points I heard time and time again were: Not knowing why something broke Getting burnt with high cloud compute costs Taking too long to build data solutions/complete data projects Needing expertise on many tools and technologies

article thumbnail

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

article thumbnail

Primary Key and Foreign Key constraints are GA and now enable faster queries

databricks

Dataricks is thrilled to announce the General Availability (GA) of Primary Key (PK) and Foreign Key (FK) constraints, starting in Databricks Runtime 15.2.

article thumbnail

How to Use Conditional Formatting in Pandas to Enhance Data Visualization

KDnuggets

Tired of staring at bland dataframes? Discover how conditional formatting in Pandas can transform your data visualization experience!

Data 104
article thumbnail

Pickup in 3 minutes: Uber’s implementation of Live Activity on iOS

Uber Engineering

From WWDC reveal to delivery, discover how we tackled new tech, design challenges, and tight timelines to enhance rider & driver experiences with Live Activity® from Apple.

article thumbnail

Mainframe History: How Mainframe Computers Have Changed Over the Years

Precisely

Mainframes have one of the longest histories of any kind of computing technology that is still used today. In fact, mainframe history shows the fast-evolving landscape of technology, few innovations have left as profound a mark as mainframe computers. From their inception to the sophisticated systems of today, mainframes have continuously adapted to meet the ever-growing demands of business operations.

article thumbnail

Business Intelligence 101: How To Make The Best Solution Decision For Your Organization

Speaker: Evelyn Chou

Choosing the right business intelligence (BI) platform can feel like navigating a maze of features, promises, and technical jargon. With so many options available, how can you ensure you’re making the right decision for your organization’s unique needs? 🤔 This webinar brings together expert insights to break down the complexities of BI solution vetting.

article thumbnail

Building Industry IoT and M2M Solutions With Databricks for Communications

databricks

The communications industry is experiencing immense change due to rapid technological advancements and evolving market trends. Communications service providers (CSP) build various solutions.

Building 101
article thumbnail

How to Use the pivot_table Function for Advanced Data Summarization in Pandas

KDnuggets

Let's learn to use Pandas pivot_table in Python to perform advance data summarization

Python 106
article thumbnail

Beyond Automation: Unveiling the True Essence of BDD by Xin Chen

Scott Logic

Many organisations claim they are applying Behaviour-Driven Development (BDD). When you discuss BDD with them, they often present numerous feature files as evidence of their adoption of the BDD methodology. These days, some still believe that writing a test case in the Given-When-Then format constitutes BDD, or that BDD is synonymous with a test automation framework.

article thumbnail

Cortex Search: State-of-the-art hybrid search for RAG and AI App

Snowflake

Snowflake Cortex Search, a fully managed search service for documents and other unstructured data, is now in public preview. With Cortex Search, organizations can effortlessly deploy retrieval-augmented generation (RAG) applications with Snowflake, powering use cases like customer service, financial research and sales chatbots. Cortex Search offers state-of-the-art semantic and lexical search over your text data in Snowflake behind an intuitive user interface, and it comes with the robust securi

article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.