Sat.Jul 30, 2022 - Fri.Aug 05, 2022

article thumbnail

How to Deal with Categorical Data for Machine Learning

KDnuggets

Check out this guide to implementing different types of encoding for categorical data, including a cheat sheet on when to use what type.

article thumbnail

Speeding up Queries With Z-Order

Cloudera

Z-order is an ordering for multi-dimensional data, e.g. rows in a database table. Once data is in Z-order it is possible to efficiently search against more columns. This article reveals how Z-ordering works and how one can use it with Apache Impala. In a previous blog post , we demonstrated the power of Parquet page indexes, which can greatly improve the performance of selective queries.

article thumbnail

Data Mesh?—?A Data Movement and Processing Platform @ Netflix

Netflix Tech

Data Mesh?—?A Data Movement and Processing Platform @ Netflix By Bo Lei , Guilherme Pires , James Shao , Kasturi Chatterjee , Sujay Jain , Vlad Sydorenko Background Realtime processing technologies (A.K.A stream processing) is one of the key factors that enable Netflix to maintain its leading position in the competition of entertaining our users. Our previous generation of streaming pipeline solution Keystone has a proven track record of serving multiple of our key business needs.

Process 109
article thumbnail

What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Data Engineering Podcast

Summary Data lineage is the roadmap for your data platform, providing visibility into all of the dependencies for any report, machine learning model, or data warehouse table that you are working with. Because of its centrality to your data systems it is valuable for debugging, governance, understanding context, and myriad other purposes. This means that it is important to have an accurate and complete lineage graph so that you don’t have to perform your own detective work when time is in s

IT 100
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Most In-demand Artificial Intelligence Skills To Learn In 2022

KDnuggets

Artificial Intelligence (AI) is the process of programming a computer that can reason and learn like a human being and make decisions for itself.

article thumbnail

An "Everything Data" Approach to Smart Cities

Teradata

Teradata’s approach to the Smart City is an analytics-centric, city-data-ecosystem approach designed to give access across all relevant data. Find out more.

Data 98

More Trending

article thumbnail

Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda

Data Engineering Podcast

Summary Exploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data

article thumbnail

Getting Started with SQL Cheatsheet

KDnuggets

Want to get started with SQL? Check out the latest cheatsheet from KDnuggets to get up to speed on the basics of one of the most popular, useful, and in-demand languages in the world of data science.

SQL 150
article thumbnail

Getting Started with Database Modernization

Confluent

Move to any cloud, modernize any database, and integrate data in real-time with Confluent, reducing the costs of syncing on-prem and cloud deployments.

article thumbnail

Fine-Tune Fair to Capacity Scheduler in Weight Mode

Cloudera

Introduction. Cloudera Data Platform (CDP) unifies the technologies from Cloudera Enterprise Data Hub (CDH) and Hortonworks Data Platform (HDP). As part of that unification process, Cloudera merged the YARN Scheduler functionality from the legacy platforms, creating a Capacity Scheduler that better services all customers. In merging this scheduler functionality, Cloudera significantly reduced the time and effort to migrate from CDH and HDP.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Case Study: How Rockset Turbocharges Real-Time Personalization at Whatnot

Rockset

Whatnot is a venture-backed e-commerce startup built for the streaming age. We’ve built a live video marketplace for collectors, fashion enthusiasts, and superfans that allows sellers to go live and sell anything they’d like through our video auction platform. Think eBay meets Twitch. Coveted collectibles were the first items on our livestream when we launched in 2020.

Kafka 52
article thumbnail

Trust in AI is Priceless

KDnuggets

Many machine learning models fail to deliver. Sadly, it’s often due to a lack of focus on data quality.

article thumbnail

Confluent announces launch of Cloud Reseller Program

Confluent

The reseller program allows consulting partners to receive wholesale Confluent Cloud pricing, own their customer relationships, and help them maximize the value of their data.

article thumbnail

Pay after placement Data Science

U-Next

As a career option, Data Science is India’s latest youth buzz. And the reasons for it are a dynamic work sector, great compensation, and a prestigious job rep. . After-placement payment Introduction to Data Science. Data are considered new age gold mines. Companies from all sectors recognise the value of utilising data to analyse performances and predict outcomes to facilitate judgement calls.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

The Modern-Day AI Executive: Most AI Investments Return Zero

Elder Research

The post The Modern-Day AI Executive: Most AI Investments Return Zero appeared first on Elder Research.

52
article thumbnail

10 Most Used Tableau Functions

KDnuggets

Learn about the most used string, number, date, logical, and aggregation Tableau functions.

article thumbnail

Apache Kafka at Home: A Houseplant Alerting System with ksqlDB

Confluent

Learn how we built a practical data pipeline use case, powering real-time alerts for when to water houseplants using Apache Kafka and ksqlDB.

Kafka 64
article thumbnail

Spark Data Lineage

Yelp Engineering

In this blog post, we introduce Spark-Lineage, an in-house product to track and visualize how data at Yelp is processed, stored, and transferred among our services. What is Spark-Lineage? Spark and Spark-ETL: At Yelp, Spark is considered a first-class citizen, handling batch jobs in all corners, from crunching reviews to identify similar restaurants in the same area, to performing reporting analytics about optimizing local business search.

Data 52
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Cyber Security Analyst Salary

U-Next

It’s always a great idea to check salary beforehand when considering joining a new field. Here you can read everything about monthly Cyber Security Analyst salaries and the highest paying Cyber Security jobs. Introduction to Cyber Security Analyst Salary. The salary of a Cyber Security Analyst depends on lots of different factors. Salary varies as per experience, the number of jobs available in the market corresponding to the supply of professionals, and the level of qualification a person

article thumbnail

A community developing a Hugging Face for customer data modeling

KDnuggets

A year ago, Objectiv started a community of 50 companies to develop a Hugging Face like open-source project for customer data modeling. They key objective: enable building data models on one team/company’s dataset, and then run them seamlessly on another.

Datasets 139
article thumbnail

3 Questions With Sapna Nair — Eventbrite’s New VP of Engineering in India

Eventbrite Engineering

Sapna Nair joins Eventbrite as our new Managing Director and Vice President of Engineering in India. Sapna is a dynamic leader who will lead Eventbrite’s expansion into India and add to our engineering expertise. Her experience building distributed teams will accelerate hiring of top-tier talent in India, helping to deliver on our ambitious technical vision … Continue reading "3 Questions With Sapna Nair — Eventbrite’s New VP of Engineering in India" The post 3 Questions With Sapna Nair —

article thumbnail

How We’re Implementing a Data Mesh at Sanne Group

Monte Carlo

Initial thoughts on our data team’s data mesh implementation plan and moving toward the four data mesh principles of domain data ownership, data as a product, self-service, and federated governance. The buzz around the data mesh is interesting in that many data professionals have opinions about it, some are even moving towards it, but very few are bold enough to claim they have done it.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Enforcing rules at scale with pre-commit-dbt

dbt Developer Hub

At dbt Labs, we have best practices we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least unique and not_null tests on their primary key. But how can we enforce rules like this? That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions.

Python 52
article thumbnail

Decision Trees vs Random Forests, Explained

KDnuggets

A simple, non-math heavy explanation of two popular tree-based machine learning models.

article thumbnail

Android in Analytics Infra

Yelp Engineering

At Yelp, we have a reasonably large Android community for a company of Yelp’s size. These talented and skilled Android engineers work on Yelp’s client and business applications. We would like to share some of the unique challenges that we’ve experienced along with our various efforts to overcome those challenges. Analytics Infra is a team at Yelp that works on experimentation and logging platforms and supports them across the entire Yelp ecosystem.

article thumbnail

Monte Carlo and Databricks Partner to Help Companies Build More Reliable Data Lakehouses

Monte Carlo

As companies increasingly leverage data-driven insights to innovate and maintain their competitive edge, it’s essential that this data is accurate and reliable. With Monte Carlo and Databricks’ partnership, teams can trust their data through end-to-end data observability across their lakehouse environments. Has your CTO ever told you that the numbers in a report you showed her looked way off?

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

How to Become Cyber Security Expert

U-Next

The demand for cyber security experts and engineers is prevalent worldwide. You just need the right guidance to study and fetch a job as a cyber security professional. Read on to learn more about cyber security. Introduction . Every network and gadget has the potential to be dangerous. Cybersecurity hazards are one of these dangers. Explore how to be a cybersecurity expert and contribute to the safety of the digital world.

article thumbnail

Full Stack Everything? Organizational Intersections Between Data Science, Dev & Tech

KDnuggets

Breakthrough value is found when teams collaborate at their intersections to come up with innovative solutions.

article thumbnail

How Many Nodes Are in a Snowflake Virtual Warehouse? | Propel Data Analytics Blog

Propel Data

Snowflake uses credits, which are analogous to CPU nodes, in order to pay for the virtual warehouses that power its analytical query engine.

article thumbnail

Free MLOps Crash Course for Beginners

KDnuggets

Interest in, and demand for, MLOps is growing exponentially. What, exactly, is it? Why is it important? Where should you turn next to learn more? Check out this crash course to find the answers to these questions and more.

IT 131
article thumbnail

The Cloud Development Environment Adoption Report

Cloud Development Environments (CDEs) are changing how software teams work by moving development to the cloud. Our Cloud Development Environment Adoption Report gathers insights from 223 developers and business leaders, uncovering key trends in CDE adoption. With 66% of large organizations already using CDEs, these platforms are quickly becoming essential to modern development practices.