Sat.Feb 05, 2022 - Fri.Feb 11, 2022

article thumbnail

Managing Your Reusable Python Code as a Data Scientist

KDnuggets

Here are a few approaches that I have settled on for managing my own reusable Python code as a data scientist, presented from most to least general code use, and aimed at beginners.

Python 160
article thumbnail

#ClouderaLife Spotlight: Marque Blackman, Director of Global Workplace

Cloudera

As we celebrate Black History Month, for this Employee Spotlight I sat down with Marque Blackman, co-lead of the Cloudera Black Employee Network (CBEN). We discussed his experience at Cloudera, his career transitions, and what he learned along the way. We also discussed his work with CBEN and his perspective on Black History Month. Meet Marque Blackman, Director of Global Workplace .

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Scalable Strategies For Protecting Data Privacy In Your Shared Data Sets

Data Engineering Podcast

Summary There are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors.

Data 100
article thumbnail

New Data Horizons: Data Prep, Data Visualization, and Data Catalogs Are Ready for Prime Time

DataKitchen

The post New Data Horizons: Data Prep, Data Visualization, and Data Catalogs Are Ready for Prime Time first appeared on DataKitchen.

Data 98
article thumbnail

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

article thumbnail

How to Learn Math for Machine Learning

KDnuggets

So how much math do you need to know in order to work in the data science industry? The answer: Not as much as you think.

article thumbnail

Announcing the GA of Cloudera DataFlow for the Public Cloud on Microsoft Azure

Cloudera

After the launch of Cloudera DataFlow for the Public Cloud (CDF-PC) on AWS a few months ago, we are thrilled to announce that CDF-PC is now generally available on Microsoft Azure, allowing NiFi users on Azure to run their data flows in a cloud-native runtime. . With CDF-PC, NiFi users can import their existing data flows into a central catalog from where they can be deployed to a Kubernetes based runtime through a simple flow deployment wizard or with a single CLI command.

Cloud 116

More Trending

article thumbnail

Data pipeline asset management with Dataflow

Netflix Tech

by Sam Setegne, Jai Balani, Olek Gorajek Glossary asset ?—?any business logic code in a raw (e.g. SQL) or compiled (e.g. JAR) form to be executed as part of the user defined data pipeline. data pipeline ?—?a set of tasks (or jobs) to be executed in a predefined order (a.k.a. DAG) for the purpose of transforming data using some business logic. Dataflow ?

article thumbnail

The Complete Collection of Data Science Cheat Sheets – Part 1

KDnuggets

A collection of cheat sheets that will help you prepare for a technical interview, assessment tests, class presentation, and help you revise core data science concepts.

article thumbnail

Getting Started with Machine Learning

Cloudera

In recent years, Ethical AI has become an area of increased importance to organisations. Advances in the development and application of Machine Learning (ML) and Deep Learning (DL) algorithms, require greater care to ensure that the ethics embedded in previous rule-based systems are not lost. This has led to Ethical AI being an increasingly popular search term and the subject of many industry analyst reports and papers.

article thumbnail

How To Join Data in MongoDB

Rockset

MongoDB is one of the most popular databases for modern applications. It enables a more flexible approach to data modeling than traditional SQL databases. Developers can build applications more quickly because of this flexibility and also have multiple deployment options, from the cloud MongoDB Atlas offering through to the open-source Community Edition.

MongoDB 52
article thumbnail

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

article thumbnail

Palantir Developers: Learn to build in Palantir Foundry

Palantir

Introducing new resources for developers to elevate their impact in Foundry. Everyone in an organization should be able to use the right data to make the best decisions. That’s why Palantir is committed to making Foundry as intuitive and accessible as possible — not only for data scientists and engineers, but also for sales, product development, recruiting, and more.

article thumbnail

Build a Web Scraper with Python in 5 Minutes

KDnuggets

In this article, I will show you how to create a web scraper from scratch in Python.

Python 159
article thumbnail

Gartner® Recognizes Cloudera in Critical Capabilities for Cloud Database Management Systems for Operational Use Cases

Cloudera

Cloudera has been recognized as a Visionary in 2021 Gartner® Magic Quadrant for Cloud Database Management Systems (DBMS) and for the first time, evaluated CDP Operational Database (COD) against the 12 critical capabilities for Operational Databases. Overall, Gartner recognized 20 vendors for the Magic Quadrant of which 16 were evaluated in the 2021 Gartner Critical Capabilities for Cloud Database Management Systems for Operational Use Cases and 18 vendors for the 2021 Gartner Critical Capabil

article thumbnail

Building the Business Case for DataOps

DataKitchen

The post Building the Business Case for DataOps first appeared on DataKitchen.

article thumbnail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

article thumbnail

Make a Snake Game with Scala in 10 Minutes

Rock the JVM

The ultimate 10-minute guide to building a Snake game in Scala: learn fast and code smarter

Scala 52
article thumbnail

Junior Data Scientist: The Next Level

KDnuggets

There is a difference in the level of experience compared to Junior, Mid-Level, and Senior Data Scientists. This article will go through the expectations for all job roles and what is required to move up the ladder.

Data 132
article thumbnail

ETL Testing Process

Grouparoo

Today, organizations are adopting modern ETL tools and approaches to gain as many insights as possible from their data. However, to ensure the accuracy and reliability of such insights, effective ETL testing needs to be performed. So what is an ETL tester’s responsibility? In this ETL testing tutorial, we’ll look at what ETL testing involves, the different types of ETL tests, and some challenges of ETL testing.

Process 52
article thumbnail

Monte Carlo Data Observability Insights Now Available in the Snowflake Data Marketplace

Monte Carlo

Is your data quality improving? What is your most used data? Where in the pipeline are your most frequent data issues occurring? With Snowflake Secure Data Sharing, building custom workflows and dashboards to answer these questions has never been easier. I am excited to announce Monte Carlo Data Observability Insights , end-to-end operational analytics of an organization’s data platform, is now available in the Snow flake Data Marketplace.

article thumbnail

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

article thumbnail

FS2: More Than Functional Streaming in Scala

Rock the JVM

Discover the ultimate tutorial on purely functional streams in Scala with FS2

Scala 52
article thumbnail

The Not-so-Sexy SQL Concepts to Make You Stand Out

KDnuggets

Databases are the houses of our data and data scientists HAVE TO HAVE A KEY! In this article, I discuss some lesser known concepts of SQL that data scientists do not familiarize themselves with.

SQL 126
article thumbnail

The JaffleGaggle Story: Data Modeling for a Customer 360 View

dbt Developer Hub

Editor's note: In this tutorial, Donny walks through the fictional story of a SaaS company called JaffleGaggle, who needs to group their freemium individual users into company accounts (aka a customer 360 view) in order to drive their product-led growth efforts. You can follow along with Donny's data modeling technique for identity resolution in this dbt project repo.

article thumbnail

Time Series Forecasting: What, Why, and, How?

ProjectPro

This blog introduces the concept of time series forecasting models in the most detailed form. First, there will be a simple introduction to highlight the significance of such models. Next, you will find a section that presents the definition of a time series forecasting article. After that, you will explore popular time-series-forecasting models. The blog's last two parts cover various use cases of these models and projects related to time series analysis and forecasting problems.

article thumbnail

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

article thumbnail

Releasing Connexion to the Community

Zalando Engineering

Connexion is a Python framework that automagically handles HTTP requests based on OpenAPI specification (formerly known as Swagger Spec) of your API described in YAML format. Connexion allows you to write an OpenAPI specification, then maps the endpoints to your Python functions; this makes it unique, as many tools generate the specification based on your Python code.

Scala 40
article thumbnail

5 Ways to Apply AI to Small Data Sets

KDnuggets

It is better to use AI algorithms on small data sets for results free of human errors and false results when applied correctly. Here are some methods to apply AI to small data sets.

Algorithm 120
article thumbnail

Data Engineering Annotated Monthly – January 2022

Big Data Tools

Due to the public holidays in Russia and my own vacation time, I didn’t get a chance to write an Annotated for December. Waiting a little longer might not be such a bad thing in this case, because now we have even more interesting releases to talk about! Hi, I’m Pasha Finkelshteyn , and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering sector and highlight new ideas from the wider community.

article thumbnail

The motivation behind using graph convolutions

KDnuggets

This article is an excerpt from the book Machine Learning with PyTorch and Scikit-Learn is the new book from the widely acclaimed and bestselling Python Machine Learning series, fully updated and expanded to cover PyTorch, transformers, graph neural networks, and best practices.

article thumbnail

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Deploying a Streamlit WebApp to Heroku using DAGsHub

KDnuggets

Transform your machine learning models into a web app and share them with your friends and colleagues.

article thumbnail

Data Mesh & Its Distributed Data Architecture

KDnuggets

Going forward, data professionals have found a new way to address the scalability of sources through data mesh.

article thumbnail

Data Science Definition Humor: A Collection of Quirky Quotes Related to Data Science Definitions

KDnuggets

Read this collection of humorous, insightful quotes around data science that will hopefully brighten your day and make you laugh!

article thumbnail

KDnuggets™ News 22:n06, Feb 9: Data Science Programming Languages and When To Use Them; Complete Collection of Data Science Cheat Sheets

KDnuggets

Data Science Programming Languages and When To Use Them; The Complete Collection of Data Science Cheat Sheets – Part 1; Build a Web Scraper with Python in 5 Minutes; 8 Best Data Science Courses to Enroll in 2022 For Steep Career Advancement; Classifying Long Text Documents Using BERT.

article thumbnail

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m