May, 2021

article thumbnail

How to make data pipelines idempotent

Start Data Engineering

What is an idempotent function Pre-requisites Why idempotency matters Making your data pipeline idempotent Conclusion Further reading References What is an idempotent function “Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application” - wikipedia Defined as f(f(x)) = f(x) In the data engineering context, this can come to mean that: running a data pipeline

article thumbnail

The Architecture of Uber’s API gateway

Uber Engineering

API gateways are an integral part of microservices architecture in recent years. An API gateway provides a single point of entry for all our apps and provides an interface to access data, logic, or functionality from back-end microservices. It also … The post The Architecture of Uber’s API gateway appeared first on Uber Engineering Blog.

article thumbnail

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

Introduction. In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. This year, we expanded our partnership with NVIDIA , enabling your data teams to dramatically speed up compute processes for data engineering and data science workloads with no code changes using RAPIDS AI.

article thumbnail

Introducing Confluent for Kubernetes

Confluent

We are excited to announce that Confluent for Kubernetes is generally available! Today, we are enabling our customers to realize many of the benefits of our cloud service with the […].

Cloud 137
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

Data Engineering Podcast

Summary The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge.

article thumbnail

What’s the Secret Recipe for DataOps?

DataKitchen

Catalog & Cocktails podcast hosts Tim Gasper & Juan Sequeda of data.world interview DataKitchen CEO Chris Bergh on how to create the right DataOps culture & measuring the value of your DataOps strategy. The post What’s the Secret Recipe for DataOps? first appeared on DataKitchen.

98

More Trending

article thumbnail

My (Seemingly) Random Walk to Netflix

Netflix Tech

Part of our series on who works in Analytics at Netflix?—?and what the role entails By Sean Barnes, Studio Production Data Science & Engineering I am going to tell you a story about a person that works for Netflix. That person grew up dreaming of working in the entertainment industry. They attended the University of Southern California, double majored in data science and television & film production, and graduated summa cum laude.

article thumbnail

Spark on Kubernetes – Gang Scheduling with YuniKorn

Cloudera

Apache YuniKorn (Incubating) has just released 0.10.0 ( release announcement ). As part of this release, a new feature called Gang Scheduling has become available. By leveraging the Gang Scheduling feature, Spark jobs scheduling on Kubernetes becomes more efficient. What is Apache YuniKorn (Incubating)? Apache YuniKorn (Incubating) is a new Apache incubator project that offers rich scheduling capabilities on Kubernetes.

Metadata 136
article thumbnail

Confluent CLI Launches Exciting New Features and an Intuitive UI

Confluent

With so many technologies in the modern development ecosystem, a common complaint is having to go through the mental gymnastics of adopting new products and keeping up with ever-expanding feature […].

article thumbnail

Easily Build Advanced Similarity Search With The Pinecone Vector Database

Data Engineering Podcast

Summary Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors.

Database 100
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Data Observability and Monitoring with DataOps

DataKitchen

Data errors impact decision-making. When analytics and dashboards are inaccurate, business leaders may not be able to solve problems and pursue opportunities. Data errors infringe on work-life balance. They cause people to work long hours at the expense of personal and family time. Data errors also affect careers. If you have been in the data profession for any length of time, you probably know what it means to face a mob of stakeholders who are angry about inaccurate or late analytics.

article thumbnail

Big Data Analytics: How It Works, Tools, and Real-Life Applications

AltexSoft

Big Data enjoys the hype around it and for a reason. But the understanding of the essence of Big Data and ways to analyze it is still blurred. The truth is, there’s more to this term than just the size of information generated. Not only does Big Data apply to the huge volumes of continuously growing data that come in different formats, but it also refers to the range of processes, tools, and approaches used to gain insights from that data.

article thumbnail

Netflix Drive

Netflix Tech

A file and folder interface for Netflix Cloud Services Written by Vikram Krishnamurthy , Kishore Kasi , Abhishek Kapatkar , and Tejas Chopra In this post, we are introducing Netflix Drive, a Cloud drive for media assets and providing a high level overview of some of its features and interfaces. We intend this to be a first post in a series of posts covering Netflix Drive.

article thumbnail

The Ethics of AI Comes Down to Conscious Decisions

Cloudera

This blog post was written by Pedro Pereira as a guest author for Cloudera. . Right now, someone somewhere is writing the next fake news story or editing a deepfake video. An authoritarian regime is manipulating an artificial intelligence (AI) system to spy on technology users. No matter how good the intentions behind the development of a technology, someone is bound to corrupt and manipulate it.

Algorithm 120
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Error Handling Patterns for Apache Kafka Applications

Confluent

Apache Kafka® applications run in a distributed manner across multiple containers or machines. And in the world of distributed systems, what can go wrong often goes wrong. This blog post […].

Kafka 128
article thumbnail

A Holistic Approach To Data Governance Through Self Reflection At Collibra

Data Engineering Podcast

Summary Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process.

article thumbnail

Thirteen Thoughts About the Data Mesh

Teradata

The concept of Data Mesh is abuzz in the industry right now. Find out why we're so enthusiastic about it.

Data 72
article thumbnail

Computer Vision in Healthcare: Creating an AI Diagnostic Tool for Medical Image Analysis

AltexSoft

Our lungs are the only body organs that constantly interact with the external environment, through the air we breathe. This exposure makes the respiratory system extremely susceptible to a wide range of diseases, from long-familiar asthma to novel COVID-19. Subtle at early stages, the signs of lung conditions are easy to overlook. And delays in diagnosis often lead to harsh consequences.

Medical 72
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Achieving observability in async workflows

Netflix Tech

Written by Colby Callahan , Megha Manohara , and Mike Azar. Managing and operating asynchronous workflows can be difficult without the proper tools and architecture that puts observability, debugging, and tracing at the forefront. Imagine getting paged outside normal work hours?—?users are having trouble with the application you’re responsible for, and you start diving into logs.

Java 67
article thumbnail

Automating CDP Private Cloud Installations with Ansible

Cloudera

The introduction of CDP Public Cloud has dramatically reduced the time in which you can be up and running with Cloudera’s latest technologies, be it with containerised Data Warehouse , Machine Learning , Operational Database or Data Engineering experiences or the multi-purpose VM-based Data Hub style of deployment. In CDP Private Cloud, the introduction of Cloudera Data Warehouse and Cloudera Machine Learning Experiences on RedHat OpenShift Kubernetes clusters means that we can deploy new

Cloud 106
article thumbnail

Kafka Summit Europe 2021 Recap

Confluent

And that’s a wrap on Kafka Summit Europe 2021, the first of three global Kafka Summits this year. We’ve seen 17,000 registrations from over 7,000 companies and 137 different countries. […].

Kafka 123
article thumbnail

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Data Engineering Podcast

Summary Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information.

Metadata 100
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

How DataOps Enables a Data Fabric

DataKitchen

The post How DataOps Enables a Data Fabric first appeared on DataKitchen.

Data 66
article thumbnail

Intelligent Document Processing: Technology Overview

AltexSoft

Whatever the industry, various documents accompany at least a quarter of business operations. Healthcare, for example, is filled with millions of patient records and medical forms. As far as transportation, these can be maintenance and driver logs. The documents often come in semi-structured and unstructured data formats, which makes them difficult to process quickly and accurately.

article thumbnail

Beyond Resilience-The Next Generation of Supply Chain

Teradata

After the shock of COVID exposed the brittle nature of many global supply chains, focus has shifted to resilience, a necessary consideration but not the only one.

64
article thumbnail

Streaming Market Data with Flink SQL Part II: Intraday Value-at-Risk

Cloudera

This article is the second in a multipart series to showcase the power and expressibility of FlinkSQL applied to market data. In case you missed it, part I starts with a simple case of calculating streaming VWAP. Code and data for this series are available on github. Speed matters in financial markets. Whether the goal is to maximize alpha or minimize exposure, financial technologists invest heavily in having the most up-to-date insights on the state of the market and where it is going.

SQL 104
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Using kafka-merge-purge to Deal with Failure in an Event-Driven System at FLYERALARM

Confluent

Failures are inevitable in any system, and there are various options for mitigating them automatically. This is made possible by event-driven applications leveraging Apache Kafka® and built with fault tolerance […].

Kafka 95
article thumbnail

Building Your Data Warehouse On Top Of PostgreSQL

Data Engineering Podcast

Summary There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project.

article thumbnail

Asynchronous APIs in CRM and marketing tools

Grouparoo

When integrating with Destinations , there are generally two main approaches made available by API providers: single or batched. With the "single" approach, one API request usually affects a single profile in the destination. The "batched" approach, which you can read more about here , allows you to affect multiple profiles in a single API request.

Process 52
article thumbnail

Data Transformations Using the Data Build Tool

Ripple Engineering

At Ripple , we are moving towards building complex business models out of raw data. To do this successfully, we need to automate our historically manual processes. Even with a digital-first approach, many of our internal processes were done by hand, making them great candidates to be automated. A prime example of this was the process of managing our data transformation workflows.

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.