July, 2021

article thumbnail

How to Validate Datatypes in Python

Start Data Engineering

Introduction Using Native Python Using Pydantic Pydantic Caveats Conclusion Further reading References Introduction Data type issues are one of the biggest concerns when processing data in python. If you are wondering how to Make sure that a column is of a specific data type ( e.g.

Python 130
article thumbnail

Airflow on Kubernetes : Get started in 10 mins

Marc Lamberti

Airflow on Kubernetes is quite popular isn’t it? There is a good chance that you know Kubernetes, that you even have a Kubernetes cluster and you would like to deploy and run Airflow on it. However, Kubernetes is hard. There is so many things to deal with that it can be really laborious to just deploy an application. Hopefully for us, some super smart people have created Helm.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Containerizing Apache Hadoop Infrastructure at Uber

Uber Engineering

Introduction. As Uber’s business grew, we scaled our Apache Hadoop (referred to as ‘Hadoop’ in this article) deployment to 21000+ hosts in 5 years, to support the various analytical and machine learning use cases. We built a team with varied … The post Containerizing Apache Hadoop Infrastructure at Uber appeared first on Uber Engineering Blog.

Hadoop 145
article thumbnail

Tyrannical Data and Its Antidotes in the Microservices World

Confluent

Data is the lifeblood of so much of what we build as software professionals, so it’s unsurprising that operations involving its transfer occupy the vast majority of developer time across […].

IT 141
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Delivering Modern Enterprise Data Engineering with Cloudera Data Engineering on Azure

Cloudera

After the launch of CDP Data Engineering (CDE) on AWS a few months ago, we are thrilled to announce that CDE, the only cloud-native service purpose built for enterprise data engineers, is now available on Microsoft Azure. . CDP Data Engineering offers an all-inclusive toolset that enables data pipeline orchestration, automation, advanced monitoring, visual profiling, and a comprehensive management toolset for streamlining ETL processes and making complex data actionable across your analytic team

article thumbnail

Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar

Data Engineering Podcast

Summary Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone

BI 100

More Trending

article thumbnail

DevOps Is Not DataOps

DataKitchen

Arvind Murali, Intelligent Data podcast host, interviews DataKitchen CEO Chris Bergh about how DataOps helps improve the speed of data & analytics deployment. The post DevOps Is Not DataOps first appeared on DataKitchen.

Data 98
article thumbnail

Uber’s Fulfillment Platform: Ground-up Re-architecture to Accelerate Uber’s Go/Get Strategy

Uber Engineering

Introduction to Fulfillment at Uber. Uber’s mission is to help our consumers effortlessly go anywhere and get anything in thousands of cities worldwide. At its core, we capture a consumer’s intent and fulfill it by matching it with the right … The post Uber’s Fulfillment Platform: Ground-up Re-architecture to Accelerate Uber’s Go/Get Strategy appeared first on Uber Engineering Blog.

article thumbnail

Create a Data Analysis Pipeline with Apache Kafka and RStudio

Confluent

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to […].

article thumbnail

Reflecting on Cloudera’s Commitment to Address Workplace Inequality: One Year Later

Cloudera

It’s been a year of awakening and change across the U.S. and around the world. One year ago our CEO Rob Bearden vowed to take decisive action to make Cloudera a more diverse, equitable, and inclusive place to work and have Cloudera take an active role in promoting those attributes in the tech industry and our communities. . There is no one size fits all solution to creating an intentional and strategic plan for a diverse workforce.

Finance 122
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Data Engineering Podcast

Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale.

Building 100
article thumbnail

COVID-19 Pandemic Analytics for a Safe Return-To-Office

Teradata

Learn how Teradata is using Advanced Analytics to guide its safe Return-To-Office (RTO) policy for its global employees. Read more.

IT 98
article thumbnail

Data Engineers of Netflix?—?Interview with Kevin Wylie

Netflix Tech

Data Engineers of Netflix?—?Interview with Kevin Wylie This post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. Kevin Wylie is a Data Engineer on the Content Data Science and Engineering team. In this post, Kevin talks about his extensive experience in content analytics at Netflix since joining more than 10 years ago.

article thumbnail

Elastic Distributed Training with XGBoost on Ray

Uber Engineering

Introduction. Since we productionized distributed XGBoost on Apache Spark™ at Uber in 2017, XGBoost has powered a wide spectrum of machine learning (ML) use cases at Uber, spanning from optimizing marketplace dynamic pricing policies for Freight , improving times of … The post Elastic Distributed Training with XGBoost on Ray appeared first on Uber Engineering Blog.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Protecting Data Integrity in Confluent Cloud: Over 8 Trillion Messages Per Day

Confluent

It’s about maintaining the right data even when no one is watching. Last year, Confluent announced support for Infinite Storage, which fundamentally changes data retention in Apache Kafka® by allowing […].

article thumbnail

#ClouderaLife Spotlight: Vinicius Cardoso, Sr Solutions Engineering

Cloudera

Meet Vinicius Cardoso, better known as Vini. . He is a Sr. Solutions Engineer (SE) working in Australia. . In his role, customers are at the center of everything he does. Wearing the hat of Enterprise Architect, he dives deep to understand customer’s organization goals, initiatives and requirements in order to identify the key capabilities that need to be delivered. .

article thumbnail

Bringing The Metrics Layer To The Masses With Transform

Data Engineering Podcast

Summary Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform.

SQL 100
article thumbnail

Recommender Systems: Behind the Scenes of Machine-Learning-Based Personalization

AltexSoft

Steve Jobs once said, “People don’t know what they want until you show it to them”. Well, try arguing that considering that we all watch videos suggested by YouTube, buy goods suggested by Amazon, and watch TV shows suggested by Netflix. People like being guided and given relevant offers and recommendations. They like being treated in a personal manner.

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

article thumbnail

Does Your Organization Need a Chief Data Officer? Probably

DataKitchen

The post Does Your Organization Need a Chief Data Officer? Probably first appeared on DataKitchen.

Data 90
article thumbnail

Customer Support Automation Platform at Uber

Uber Engineering

High Level Overview of the Problem. Introduction. If you’ve used any online/digital service, chances are that you are familiar with what a typical customer service experience entails: you send a message (usually email aliased) to the company’s support staff, fill … The post Customer Support Automation Platform at Uber appeared first on Uber Engineering Blog.

article thumbnail

Announcing ksqlDB 0.19.0

Confluent

We’re pleased to announce ksqlDB 0.19.0! This release includes a new NULLIF function and a major upgrade to ksqlDB’s data modeling capabilities—foreign-key joins. We’re excited to share this highly requested […].

Data 135
article thumbnail

Accelerate Offloading to Cloudera Data Warehouse (CDW) with Procedural SQL Support

Cloudera

Did you know Cloudera customers, such as SMG and Geisinger , offloaded their legacy DW environment to Cloudera Data Warehouse (CDW) to take advantage of CDW’s modern architecture and best-in-class performance? In addition to substantial cost savings upon moving to CDW, Geisinger is also able to search through hundreds of million patient note records in seconds providing better treatment to their patients.

article thumbnail

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

article thumbnail

Strategies For Proactive Data Quality Management

Data Engineering Podcast

Summary Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue.

article thumbnail

Data cleaning for nulls with SQL vs. code

Grouparoo

When preparing your data set for analysis, it is crucial to ensure that your data set is both complete and accurate. One step in this process is deciding how to handle null values. Depending on how your data is going to be used, you may not want null values at all! Let's clean some data We're going to take a look at calculating Lifetime Value (LTV) of a customer.

SQL 52
article thumbnail

Knowledge Graph Technologies Accelerate and Improve the Data Model Definition for Master Data

Zalando Engineering

The Master Data Management Challenge Master data management (MDM) is a technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets. 1 At Zalando we are at an early phase of realising MDM for our internal data assets and we have chosen to do it in a consolidated style.

article thumbnail

Categorizing user-uploaded documents

Scribd Technology

Scribd offers a variety of publisher and user-uploaded content to our users and while the publisher content is rich in metadata, user-uploaded content typically is not. Documents uploaded by the users have varied subjects and content types which can make it challenging to link them together. One way to connect content can be through a taxonomy - an important type of structured information widely used in various domains.

article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

article thumbnail

Speed, Scale, Storage: Our Journey from Apache Kafka to Performance in Confluent Cloud

Confluent

At Confluent, we focus on the holy trinity of performance, price, and availability, with the goal of delivering a similar performance envelope for all workloads across all supported cloud providers. […].

Cloud 122
article thumbnail

Five Strategies to Accelerate Data Product Development

Cloudera

Introduction. With this first article of the two-part series on data product strategies, I am presenting some of the emerging themes in data product development and how they inform the prerequisites and foundational capabilities of an Enterprise data platform that would serve as the backbone for developing successful data product strategies. Once we have identified those capabilities, the second article explores how the Cloudera Data Platform delivers those prerequisite capabilities and has enab

article thumbnail

Low Code And High Quality Data Engineering For The Whole Organization With Prophecy

Data Engineering Podcast

Summary There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates

article thumbnail

Building a Roadmap for Enterprise Data and Analytics – A Framework

Teradata

Building a data analytics roadmap for a large, complex enterprise can be daunting. Breaking it down into essentials helps manage complexity, avoid pitfalls, & set the program in the right direction.

article thumbnail

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.