July, 2021

article thumbnail

Containerizing Apache Hadoop Infrastructure at Uber

Uber Engineering

Introduction. As Uber’s business grew, we scaled our Apache Hadoop (referred to as ‘Hadoop’ in this article) deployment to 21000+ hosts in 5 years, to support the various analytical and machine learning use cases. We built a team with varied … The post Containerizing Apache Hadoop Infrastructure at Uber appeared first on Uber Engineering Blog.

Hadoop 145
article thumbnail

Tyrannical Data and Its Antidotes in the Microservices World

Confluent

Data is the lifeblood of so much of what we build as software professionals, so it’s unsurprising that operations involving its transfer occupy the vast majority of developer time across […].

IT 141
article thumbnail

Delivering Modern Enterprise Data Engineering with Cloudera Data Engineering on Azure

Cloudera

After the launch of CDP Data Engineering (CDE) on AWS a few months ago, we are thrilled to announce that CDE, the only cloud-native service purpose built for enterprise data engineers, is now available on Microsoft Azure. . CDP Data Engineering offers an all-inclusive toolset that enables data pipeline orchestration, automation, advanced monitoring, visual profiling, and a comprehensive management toolset for streamlining ETL processes and making complex data actionable across your analytic team

article thumbnail

How to Validate Datatypes in Python

Start Data Engineering

Introduction Using Native Python Using Pydantic Pydantic Caveats Conclusion Further reading References Introduction Data type issues are one of the biggest concerns when processing data in python. If you are wondering how to Make sure that a column is of a specific data type ( e.g.

Python 130
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Airflow on Kubernetes : Get started in 10 mins

Marc Lamberti

Airflow on Kubernetes is quite popular isn’t it? There is a good chance that you know Kubernetes, that you even have a Kubernetes cluster and you would like to deploy and run Airflow on it. However, Kubernetes is hard. There is so many things to deal with that it can be really laborious to just deploy an application. Hopefully for us, some super smart people have created Helm.

article thumbnail

Data Movement in Netflix Studio via Data Mesh

Netflix Tech

By Andrew Nguonly , Armando Magalhães , Obi-Ike Nwoke , Shervin Afshar , Sreyashi Das , Tongliang Liu , Wei Liu , Yucheng Zeng Background Over the next few years, most content on Netflix will come from Netflix’s own Studio. From the moment a Netflix film or series is pitched and long before it becomes available on Netflix, it goes through many phases.

Data 103

More Trending

article thumbnail

Create a Data Analysis Pipeline with Apache Kafka and RStudio

Confluent

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to […].

article thumbnail

Reflecting on Cloudera’s Commitment to Address Workplace Inequality: One Year Later

Cloudera

It’s been a year of awakening and change across the U.S. and around the world. One year ago our CEO Rob Bearden vowed to take decisive action to make Cloudera a more diverse, equitable, and inclusive place to work and have Cloudera take an active role in promoting those attributes in the tech industry and our communities. . There is no one size fits all solution to creating an intentional and strategic plan for a diverse workforce.

Finance 126
article thumbnail

Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar

Data Engineering Podcast

Summary Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone

BI 100
article thumbnail

COVID-19 Pandemic Analytics for a Safe Return-To-Office

Teradata

Learn how Teradata is using Advanced Analytics to guide its safe Return-To-Office (RTO) policy for its global employees. Read more.

IT 98
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

DevOps Is Not DataOps

DataKitchen

Arvind Murali, Intelligent Data podcast host, interviews DataKitchen CEO Chris Bergh about how DataOps helps improve the speed of data & analytics deployment. The post DevOps Is Not DataOps first appeared on DataKitchen.

Data 98
article thumbnail

Elastic Distributed Training with XGBoost on Ray

Uber Engineering

Introduction. Since we productionized distributed XGBoost on Apache Spark™ at Uber in 2017, XGBoost has powered a wide spectrum of machine learning (ML) use cases at Uber, spanning from optimizing marketplace dynamic pricing policies for Freight , improving times of … The post Elastic Distributed Training with XGBoost on Ray appeared first on Uber Engineering Blog.

article thumbnail

Protecting Data Integrity in Confluent Cloud: Over 8 Trillion Messages Per Day

Confluent

It’s about maintaining the right data even when no one is watching. Last year, Confluent announced support for Infinite Storage, which fundamentally changes data retention in Apache Kafka® by allowing […].

article thumbnail

#ClouderaLife Spotlight: Vinicius Cardoso, Sr Solutions Engineering

Cloudera

Meet Vinicius Cardoso, better known as Vini. . He is a Sr. Solutions Engineer (SE) working in Australia. . In his role, customers are at the center of everything he does. Wearing the hat of Enterprise Architect, he dives deep to understand customer’s organization goals, initiatives and requirements in order to identify the key capabilities that need to be delivered. .

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Data Engineering Podcast

Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale.

Building 100
article thumbnail

Data Engineers of Netflix?—?Interview with Kevin Wylie

Netflix Tech

Data Engineers of Netflix?—?Interview with Kevin Wylie This post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. Kevin Wylie is a Data Engineer on the Content Data Science and Engineering team. In this post, Kevin talks about his extensive experience in content analytics at Netflix since joining more than 10 years ago.

article thumbnail

Does Your Organization Need a Chief Data Officer? Probably

DataKitchen

The post Does Your Organization Need a Chief Data Officer? Probably first appeared on DataKitchen.

Data 90
article thumbnail

Customer Support Automation Platform at Uber

Uber Engineering

High Level Overview of the Problem. Introduction. If you’ve used any online/digital service, chances are that you are familiar with what a typical customer service experience entails: you send a message (usually email aliased) to the company’s support staff, fill … The post Customer Support Automation Platform at Uber appeared first on Uber Engineering Blog.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Announcing ksqlDB 0.19.0

Confluent

We’re pleased to announce ksqlDB 0.19.0! This release includes a new NULLIF function and a major upgrade to ksqlDB’s data modeling capabilities—foreign-key joins. We’re excited to share this highly requested […].

Data 135
article thumbnail

Accelerate Offloading to Cloudera Data Warehouse (CDW) with Procedural SQL Support

Cloudera

Did you know Cloudera customers, such as SMG and Geisinger , offloaded their legacy DW environment to Cloudera Data Warehouse (CDW) to take advantage of CDW’s modern architecture and best-in-class performance? In addition to substantial cost savings upon moving to CDW, Geisinger is also able to search through hundreds of million patient note records in seconds providing better treatment to their patients.

article thumbnail

Bringing The Metrics Layer To The Masses With Transform

Data Engineering Podcast

Summary Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform.

SQL 100
article thumbnail

Recommender Systems: Behind the Scenes of Machine-Learning-Based Personalization

AltexSoft

Steve Jobs once said, “People don’t know what they want until you show it to them”. Well, try arguing that considering that we all watch videos suggested by YouTube, buy goods suggested by Amazon, and watch TV shows suggested by Netflix. People like being guided and given relevant offers and recommendations. They like being treated in a personal manner.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

What Is Referential Transparency and Why Should You Care?

Rock the JVM

Discover how referential transparency boosts your productivity as a functional programmer in Scala and why it's crucial

Scala 52
article thumbnail

Data cleaning for nulls with SQL vs. code

Grouparoo

When preparing your data set for analysis, it is crucial to ensure that your data set is both complete and accurate. One step in this process is deciding how to handle null values. Depending on how your data is going to be used, you may not want null values at all! Let's clean some data We're going to take a look at calculating Lifetime Value (LTV) of a customer.

SQL 52
article thumbnail

Speed, Scale, Storage: Our Journey from Apache Kafka to Performance in Confluent Cloud

Confluent

At Confluent, we focus on the holy trinity of performance, price, and availability, with the goal of delivering a similar performance envelope for all workloads across all supported cloud providers. […].

Cloud 122
article thumbnail

Enterprise Data Science Workflows with AMPs and Streamlit

Cloudera

Here in the virtual Fast Forward Lab at Cloudera , we do a lot of experimentation to support our applied machine learning research, and Cloudera Machine Learning product development. We believe the best way to learn what a technology is capable of is to build things with it. Only through hands-on experimentation can we discern truly useful new algorithmic capabilities from hype.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Strategies For Proactive Data Quality Management

Data Engineering Podcast

Summary Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue.

article thumbnail

Knowledge Graph Technologies Accelerate and Improve the Data Model Definition for Master Data

Zalando Engineering

The Master Data Management Challenge Master data management (MDM) is a technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets. 1 At Zalando we are at an early phase of realising MDM for our internal data assets and we have chosen to do it in a consolidated style.

article thumbnail

What Is Referential Transparency and Why Should You Care?

Rock the JVM

Discover how referential transparency boosts your productivity as a functional programmer in Scala and why it's crucial

Scala 52
article thumbnail

DataKitchen Wins Data & Analytics Vendor of the Year Award – OnConferences

DataKitchen

The post DataKitchen Wins Data & Analytics Vendor of the Year Award – OnConferences first appeared on DataKitchen.

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.