Sat.Oct 15, 2022 - Fri.Oct 21, 2022

article thumbnail

Frameworks for Approaching the Machine Learning Process

KDnuggets

This post is a summary of 2 distinct frameworks for approaching machine learning tasks, followed by a distilled third. Do they differ considerably (or at all) from each other, or from other such processes available?

article thumbnail

Pollen’s enormous debt left behind: exclusive details

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics in today’s subscriber-only The Scoop issue. To get this newsletter every week, subscribe. Pollen, the events festival tech startup, went bankrupt in August after raising more than $200M in venture funding. In an exclusive investigative article , I covered the events and details leading up this bankruptcy.

Banking 130
article thumbnail

Rust for Data Engineering

Simon Späti

Will Rust kill Python for Data Engineers? If you only came here to know this, my answer is no. Betteridge’s Law strikes again! But then again, you have to ask: was Python made for Data Engineering in the first place? Rust may not replace Python outright, but it has consumed more and more of JavaScript tooling and there are increasingly many projects trying to do the same with Python/Data Engineering.

article thumbnail

Independent Anniversary

Jesse Anderson

I have a calendar reminder that tells me when I founded Big Data Institute. It just told me I founded the company eight years ago. The reminder is called “Independent Anniversary.” It’s the day I split off and executed my vision for an independent, big data consulting company. Independence has all sorts of manifestations. For you, it’s an independent look at technology and vendors from someone who’s worked at a vendor (Cloudera) and worked in distributed systems for even longer.

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Working With Sparse Features In Machine Learning Models

KDnuggets

Sparse features can cause problems like overfitting and suboptimal results in learning models, and understanding why this happens is crucial when developing models. Multiple methods, including dimensionality reduction, are available to overcome issues due to sparse features.

article thumbnail

An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem

Data Engineering Podcast

Summary The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.

Data Lake 100

More Trending

article thumbnail

#ClouderaLife Spotlight: Elias Avila, Sr. Staff Proactive Support Engineer

Cloudera

As we wrap up Hispanic Heritage month this #ClouderaLife Spotlight features Elias Avila, senior staff proactive support engineer for Cloudera. In this spotlight, we talk about his career in technology and his philosophy for getting the most out of work in terms of satisfaction and advancement. We also talk about his upbringing in the primarily Mexican American community of Salinas, California, and the important role Hispanics play in California’s Central Valley. .

article thumbnail

7 Free Platforms for Building a Strong Data Science Portfolio

KDnuggets

Outshine others and increase your odds of getting hired by maintaining a data science portfolio with projects, resumes, blogs, and reports.

Portfolio 160
article thumbnail

Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks

Data Engineering Podcast

Summary Logistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics.

Database 100
article thumbnail

The Good and the Bad of Apache Kafka Streaming Platform

AltexSoft

We say ‘xerox’ speaking of any photocopy, whether or not it was created by a machine from the Xerox corporation. We describe information search on the Internet with just one word — ‘google’. We ‘photoshop pictures’ instead of editing them on the computer. And COVID-19 made ‘zoom’ a synonym for a videoconference. Kafka can continue the list of brand names that became generic terms for the entire type of technology.

Kafka 93
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Cloudera Uses CDP to Reduce IT Cloud Spend by $12 Million

Cloudera

Like all of our customers, Cloudera depends on the Cloudera Data Platform (CDP) to manage our day-to-day analytics and operational insights. Many aspects of our business live within this modern data architecture, providing all Clouderans the ability to ask, and answer, important questions for the business. Clouderans continuously push for improvements in the system, with the goal of driving up confidence in the data.

Cloud 95
article thumbnail

25 Advanced SQL Interview Questions for Data Scientists

KDnuggets

Check out this collection of advanced SQL interview questions with answers.

SQL 158
article thumbnail

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Netflix Tech

by Jun He , Akash Dwivedi , Natallia Dzenisenka , Snehal Chennuru , Praneeth Yenugutala , Pawan Dixit At Netflix, Data and Machine Learning (ML) pipelines are widely used and have become central for the business, representing diverse use cases that go beyond recommendations, predictions and data transformations. A large number of batch workflows run daily to serve various business needs.

Java 85
article thumbnail

5 Steps To A Successful Data Warehouse Migration

Monte Carlo

Platform and data warehouse migrations aren’t something you do everyday or even every few years, but they’re becoming much more frequent as organizations seek to modernize their data infrastructure with the new capabilities being offered by Snowflake, Databricks, Google, AWS, and others. [Editor’s note: We agree. Cloud database migrations were listed in our latest ebook The 22 Hottest Trends In Data Right Now ] Migrations are like Schrodinger’s cat.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Public or On-Prem? Telco giants are optimizing the network with the Hybrid Cloud

Cloudera

The telecommunications industry continues to develop hybrid data architectures to support data workload virtualization and cloud migration. However, while the promise of the cloud remains essential — not just for data workloads but also for network virtualisation and B2B offerings — the sheer volume and scale of data in the industry require careful management of the “journey to the cloud.”.

Cloud 84
article thumbnail

A Data Science Portfolio That Will Land You The Job in 2022

KDnuggets

Check out this article on crafting a data science portfolio that will get you that job. And learn 4 resume mistakes to avoid at any cost.

Portfolio 156
article thumbnail

Public SQL Endpoints in Rockset

Rockset

Introduction Making use of real-time data for analytics is a deeply collaborative project. We’ve helped data engineers, data architects, engineering leaders, ML teams, and product managers connect the dots between various systems to deliver on Rockset’s promise of fast queries on fresh data. Not only are we collaborating with customers on analytics projects, we use our own product daily and collaborate across teams internally.

SQL 52
article thumbnail

Data and Analytics Keep the Wheels on the Bus!

Teradata

The complexity of modern vehicles means that spotting root-causes that prevent them from working is difficult. Mechanics, operators & OEMs must step into a new era of digital data-based diagnostics.

Data 52
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Cybersecurity: A Big Data Problem

Cloudera

Information technology has been at the heart of governments around the world, enabling them to deliver vital citizen services, such as healthcare, transportation, employment, and national security. All of these functions rest on technology and share a valuable commodity: data. . Data is produced and consumed in ever-increasing amounts and therefore must be protected.

article thumbnail

Essential Books You Need to Become a Data Engineer

KDnuggets

In this article, I will go through the roadmap of books you need to become a Data Engineer.

article thumbnail

Building Real-Time Recommendations with Kafka, S3, Rockset and Retool

Rockset

Real-time customer 360 applications are essential in allowing departments within a company to have reliable and consistent data on how a customer has engaged with the product and services. Ideally, when someone from a department has engaged with a customer, you want up-to-date information so the customer doesn’t get frustrated and repeat the same information multiple times to different people.

Kafka 52
article thumbnail

Hypothesis Testing: A Step-by-Step Guide With Easy Examples

U-Next

Introduction . When we hear the word ‘hypothesis,’ the first thing that comes to our mind is a kind of theory. Assuming and explaining theories is a fundamental part of Business Analytics. In the past few years, the field of Business Analytics has proliferated and made several advancements. As the number of people interested in its statistical applications in business has increased, the concept of hypothesis testing has grabbed everyone’s attention.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Using Kafka Connect Securely in the Cloudera Data Platform

Cloudera

In this post I will demonstrate how Kafka Connect is integrated in the Cloudera Data Platform (CDP), allowing users to manage and monitor their connectors in Streams Messaging Manager while also touching on security features such as role-based access control and sensitive information handling. If you are a developer moving data in or out of Kafka, an administrator, or a security expert this post is for you.

Kafka 78
article thumbnail

10 Essential SQL Commands for Data Science

KDnuggets

Learn SQL commands for filtering, string operations, alias, joining tables, if-else statements, and grouping.

SQL 151
article thumbnail

React SEO: How To Optimize React Websites for SEO

Trio

React enables much of the modern web you’re familiar with: fluid, responsive, and animation-rich websites. It’s no wonder that React.js is the most used JavsScript framework for web development, according to the 2021 State of JavaScript survey.

article thumbnail

What Is Data Collection? Methods, Types, Tools, and Techniques

U-Next

Introduction . The primary goal of data collection is to gather high-quality information that aims to provide responses to all of the open-ended questions. Businesses and management can obtain high-quality information by collecting data that is necessary for making educated decisions. . It is necessary to gather data to draw conclusions and decide what is factual to increase the quality of the information. .

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Apache Hop 2.1.0 is available

know.bi

The Apache Hop team just released version 2.1.0. This new release is the result of four and a half months of work on over 200 tickets and comes packed with new functionality, bug fixes and improvements.

article thumbnail

Attend the Data Science Symposium 2022

KDnuggets

The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The event, held at the Lindner College of Business, is open to all.

article thumbnail

PostgreSQL vs. MySQL: 10 Key Differences 

Meltano

PostgreSQL and MySQL are among the most popular open-source relational database management systems (RDMS) worldwide. Both RDMS enable businesses to organize and interlink large amounts of data, allowing for effective data management. For all of their similarities, PostgreSQL and MySQL differ from one another in many ways. In this PostgreSQL vs. MySQL comparison, we analyze crucial differences between the two database management systems to discover how they work and when to use them.

article thumbnail

Designing Events and Event Streams: Introduction and Best Practices

Confluent

Designing Events and Event Streams: Introduction and Best Practices.

article thumbnail

The Cloud Development Environment Adoption Report

Cloud Development Environments (CDEs) are changing how software teams work by moving development to the cloud. Our Cloud Development Environment Adoption Report gathers insights from 223 developers and business leaders, uncovering key trends in CDE adoption. With 66% of large organizations already using CDEs, these platforms are quickly becoming essential to modern development practices.