September, 2022

article thumbnail

How to Correctly Select a Sample From a Huge Dataset in Machine Learning

KDnuggets

We explain how choosing a small, representative dataset from a large population can improve model training reliability.

Datasets 160
article thumbnail

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Simon Späti

Image by Rachel Claire on Pexels Ever wanted or been asked to build an open-source Data Lake offloading data for analytics? Asked yourself what components and features would that include. Didn’t know the difference between a Data Lakehouse and a Data Warehouse? Or you just wanted to govern your hundreds to thousands of files and have more database-like features but don’t know how?

Data Lake 130
article thumbnail

Airflow Taskflow API: The Guide

Marc Lamberti

Airflow Taskflow is a new way of writing DAGs at ease. As you will see, you need to write fewer lines than before to obtain the same DAG. That helps to make DAGs easier to build, read, and maintain. The Taskflow API has three main aspects: XCOM Args, Decorator, and XCOM backends. In this tutorial, you will learn what the Taskflow API is, why it is crucial for you, and how to create your DAGs.

SQL 130
article thumbnail

Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data

Data Engineering Podcast

Summary Any business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data.

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Real-Time Gaming Infrastructure for Millions of Users with Apache Kafka, ksqlDB, and WebSockets

Confluent

How gaming enterprises like Sony and Big Fish Games use Apache Kafka®, Confluent, and ksqlDB’s data streaming technologies for the best in-game experience, ROI, and real-time capabilities.

Kafka 124
article thumbnail

Large Scale Industrialization Key to Open Source Innovation

Cloudera

We are now well into 2022 and the megatrends that drove the last decade in data — The Apache Software Foundation as a primary innovation vehicle for big data, the arrival of cloud computing, and the debut of cheap distributed storage — have now converged and offer clear patterns for competitive advantage for vendors and value for customers. Cloudera has been parlaying those patterns into clear wins for the community at large and, more importantly, streamlining the benefits of that innovation to

More Trending

article thumbnail

The Rise of the Semantic Layer

Simon Späti

A semantic layer is something we use every day. We build dashboards with yearly and monthly aggregations. We design dimensions for drilling down reports by region, product, or whatever metrics we are interested in. What has changed is that we no longer use a singular business intelligence tool; different teams use different visualizations (BI, notebooks, and embedded analytics).

BI 130
article thumbnail

Top 10 Globally Recognized Certifications for Cyber Security

U-Next

Introduction . Cybersecurity or computer security and information security is the act of preventing theft, damage, loss, or unauthorized access to computers, networks, and data. As our interconnections grow, so do the chances for evil hackers to steal, destroy, or disrupt our lives. The increase in cybercrime has increased the demand for cybersecurity expertise.

article thumbnail

Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language

Data Engineering Podcast

Summary Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL.

Building 100
article thumbnail

Keeping Multiple Databases in Sync Using Kafka Connect and CDC

Confluent

Microservices have numerous benefits, but data silos are incredibly challenging. Learn how Kafka Connect and CDC provide real-time database synchronization, bridging data silos between all microservice applications.

Kafka 122
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Data Governance and Strategy for the Global Enterprise

Cloudera

In a recent blog, Cloudera Chief Technology Officer Ram Venkatesh described the evolution of a data lakehouse, as well as the benefits of using an open data lakehouse, especially the open Cloudera Data Platform (CDP). If you missed it, you can read up about it here. Modern data lakehouses are typically deployed in the cloud. Cloud computing brings several distinct advantages that are core to the lakehouse value proposition.

article thumbnail

The Mistake Every Data Scientist Has Made at Least Once

KDnuggets

How to increase your chances of avoiding the mistake.

Data 160
article thumbnail

The Rise of the Semantic Layer

Simon Späti

A semantic layer is something we use every day. We build dashboards with yearly and monthly aggregations. We design dimensions for drilling down reports by region, product, or whatever metrics we are interested in. What has changed is that we no longer use a singular business intelligence tool; different teams use different visualizations (BI, notebooks, and embedded analytics).

BI 130
article thumbnail

Rejoice! The Vantage Analytics and Data Platform Provide Incredible Power for All in a “Cloudy” Environment

Teradata

With the release of VantageCloud Lake and ClearScape Analytics, Teradata brings a cloud-native architecture to extend the technical innovations and differentiators that Vantage is well known for.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations

Data Engineering Podcast

Summary Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products.

Food 100
article thumbnail

Excited to be back at Google Cloud Next 2022!

Confluent

Highlighting sessions on the power of our Confluent-Google partnership: multi-layer data security, real-time cloud data streaming and analytics, database modernization, and more.

article thumbnail

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

The promise of a modern data lakehouse architecture. Imagine having self-service access to all business data, anywhere it may be, and being able to explore it all at once. Imagine quickly answering burning business questions nearly instantly, without waiting for data to be found, shared, and ingested. Imagine independently discovering rich new business insights from both structured and unstructured data working together, without having to beg for data sets to be made available.

article thumbnail

More Performance Evaluation Metrics for Classification Problems You Should Know

KDnuggets

When building and optimizing your classification model, measuring how accurately it predicts your expected outcome is crucial. However, this metric alone is never the entire story, as it can still offer misleading results. That's where these additional performance evaluations come into play to help tease out more meaning from your model.

Building 160
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

Netflix Tech

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support for Non-Parallelizable Workloads by Kostas Christidis Introduction Timestone is a high-throughput, low-latency priority queueing system we built in-house to support the needs of Cosmos , our media encoding platform. Over the past 2.5 years, its usage has increased, and Timestone is now also the priority queueing engine backing Conductor , our general-purpose workflow orchestration engine, and BDP Sch

Systems 89
article thumbnail

What Do You Want to be Famous for?

Teradata

Financial services organizations that exhibit true data literacy avoid bottlenecks and instead choose to build best in class solutions that meet current and future needs. Find out more.

article thumbnail

Operational Analytics To Increase Efficiency For Multi-Location Businesses With OpsAnalitica

Data Engineering Podcast

Summary In order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location

article thumbnail

Event-Driven Microservices with Python and Apache Kafka

Confluent

A deep dive into how microservices work, why it’s the backbone of real-time applications, and how to build event-driven microservices applications with Python and Kafka.

Kafka 98
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Improve Underwriting Using Data and Analytics

Cloudera

Insurance carriers are always looking to improve operational efficiency. We’ve previously highlighted opportunities to improve digital claims processing with data and AI. In this post, I’ll explore opportunities to enhance risk assessment and underwriting, especially in personal lines and small and medium-sized enterprises. Underwriting is an area that can yield improvements by applying the old saying “work smarter, not harder.

Insurance 107
article thumbnail

5 Concepts You Should Know About Gradient Descent and Cost Function

KDnuggets

Why is Gradient Descent so important in Machine Learning? Learn more about this iterative optimization algorithm and how it is used to minimize a loss function.

article thumbnail

KonMari your data: Planning a query migration using the Marie Kondo method

dbt Developer Hub

If you’ve ever heard of Marie Kondo, you’ll know she has an incredibly soothing and meditative method to tidying up physical spaces. Her KonMari Method is about categorizing, discarding unnecessary items, and building a sustainable system for keeping stuff. As an analytics engineer at your company, doesn’t that last sentence describe your job perfectly?!

article thumbnail

7 Ways To Develop A Portfolio That Gets You Hired

U-Next

Show, don’t tell is what people tell writers and screenwriters, but this is practically applicable to aspirants who want to land their dream jobs as well. Apart from your university degree and professional certifications, what adds compelling weightage to your candidacy is a solid portfolio. . A portfolio is like your business card and regardless of whether you are a fresher or someone experienced, moving up the corporate ladder, a portfolio is what will ensure you a job, that higher paycheck a

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Data Engineering Podcast

Summary There is a constant tension in business data between growing siloes, and breaking them down. Even when a tool is designed to integrate information as a guard against data isolation, it can easily become a silo of its own, where you have to make a point of using it to seek out information. In order to help distribute critical context about data assets and their status into the locations where work is being done Nicholas Freund co-founded Workstream.

Building 100
article thumbnail

6 Ways Data Streaming is Transforming Financial Services

Confluent

How banks and finance companies use Confluent to transform their digital systems with event-driven architecture, real-time payment processing, fraud detection, and analytics.

Banking 95
article thumbnail

SCIM (System for Cross-domain Identity Management)

Cloudera

The identity team at Cloudera has been working to add the System for Cross-domain Identity Management (SCIM) support to Cloudera Data Platform (CDP) and we’re happy to announce the general availability of SCIM on Azure Active Directory! In Part One we discussed: CDP SCIM Support for Active Directory, which discusses the core elements of CDP’s SCIM support for Azure AD.

Systems 107
article thumbnail

5 Data Science Skills That Pay & 5 That Don’t

KDnuggets

This article will go over the top 5 data science skills that pay you and 5 that don’t.

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.