Sat.Sep 18, 2021 - Fri.Sep 24, 2021

article thumbnail

What’s New in Apache Kafka 3.0.0

Confluent

I’m pleased to announce the release of Apache Kafka 3.0 on behalf of the Apache Kafka® community. Apache Kafka 3.0 is a major release in more ways than one. Apache […].

Kafka 145
article thumbnail

Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot

Uber Engineering

Uber recently launched a new capability: Ads on UberEats. With this new ability came new challenges that needed to be solved at Uber, such as systems for ad auctions, bidding, attribution, reporting, and more. This article focuses on how we … The post Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot appeared first on Uber Engineering Blog.

Kafka 135
article thumbnail

Airflow Trigger Rules: All you need to know!

Marc Lamberti

By default, your tasks get executed once all the parent tasks succeed. this behaviour is what you expect in general. But what if you want something more complex? What if you would like to execute a task as soon as one of its parents succeeds? Or maybe you would like to execute a different set of tasks if a task fails? Or act differently according to if a task succeeds, fails or event gets skipped?

article thumbnail

Apache Kafka Deployments and Systems Reliability – Part 1

Cloudera

There are many ways that Apache Kafka has been deployed in the field. In our Kafka Summit 2021 presentation, we took a brief overview of many different configurations that have been observed to date. In this blog series, we will discuss each of these deployments and the deployment choices made along with how they impact reliability. In Part 1, the discussion is related to: Serial and Parallel Systems Reliability as a concept, Kafka Clusters with and without Co-Located Apache Zookeeper, and Kafka

Kafka 121
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Announcing ksqlDB 0.21.0

Confluent

We’re pleased to announce ksqlDB 0.21.0! This release includes a major upgrade to ksqlDB’s foreign-key joins, the new data type BYTES, and a new ARRAY_CONCAT function. All of these features […].

Bytes 140
article thumbnail

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Data Engineering Podcast

Summary Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any

More Trending

article thumbnail

Supercharge your Airflow Pipelines with the Cloudera Provider Package

Cloudera

Many customers looking at modernizing their pipeline orchestration have turned to Apache Airflow, a flexible and scalable workflow manager for data engineers. With 100s of open source operators, Airflow makes it easy to deploy pipelines in the cloud and interact with a multitude of services on premise, in the cloud, and across cloud providers for a true hybrid architecture. .

Python 105
article thumbnail

What is an A/B Test?

Netflix Tech

Martin Tingley with Wenjing Zheng , Simon Ejdemyr , Stephanie Lane , and Colin McFarland This is the second post in a multi-part series on how Netflix uses A/B tests to inform decisions and continuously innovate on our products. See here for Part 1: Decision Making at Netflix. Subsequent posts will go into more details on the statistics of A/B tests, experimentation across Netflix, how Netflix has invested in infrastructure to support and scale experimentation, and the importance of the culture

article thumbnail

An Exploration Of The Data Engineering Requirements For Bioinformatics

Data Engineering Podcast

Summary Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities.

article thumbnail

Start DataOps Today with ‘Lean DataOps’

DataKitchen

Data organizations don’t always have the budget or schedule required for DataOps when conceived as a top-to-bottom, enterprise-wide transformational change. An essential part of the DataOps methodology is Agile Development , which breaks development into incremental steps. DataOps can and should be implemented in small steps that complement and build upon existing workflows and data pipelines.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Telecom Network Analytics: Transformation, Innovation, Automation

Cloudera

One of the most substantial big data workloads over the past fifteen years has been in the domain of telecom network analytics. Where does it stand today? What are its current challenges and opportunities? In a sense, there have been three phases of network analytics: the first was an appliance based monitoring phase; the second was an open-source expansion phase; and the third – that we are in right now – is a hybrid-data-cloud and governance phase.

article thumbnail

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

By Xiaomei Liu , Rosanna Lee , Cyril Concolato Introduction Behind the scenes of the beloved Netflix streaming service and content, there are many technology innovations in media processing. Packaging has always been an important step in media processing. After content ingestion, inspection and encoding, the packaging step encapsulates encoded video and audio in codec agnostic container formats and provides features such as audio video synchronization, random access and DRM protection.

Cloud 97
article thumbnail

Declarative Machine Learning Without The Operational Overhead Using Continual

Data Engineering Podcast

Summary Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response.

article thumbnail

Data Warehousing Basiscs

Data Science Blog: Data Engineering

Data Warehousing is applied Big Data Management and a key success factor in almost every company. Without a data warehouse, no company today can control its processes and make the right decisions on a strategic level as there would be a lack of data transparency for all decision makers. Bigger comanies even have multiple data warehouses for different purposes.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Speed Up Your Data Flow for Business Results

Cloudera

A slow car has never won a Formula One race. The Olympics doesn’t reward slow times in swimming, track or any other clock-timed sport. Likewise, slow data speeds don’t win over customers or colleagues in the real-time business world. Microsoft’s own research once reported that a person visiting a website on a connected device is likely to wait no more than 10 seconds to see it before moving to a competitor’s site.

Data 87
article thumbnail

Datakin is now open to all!

Datakin

Blog Datakin is now open to all! Written by Laurent Paris on Sep 24, 2021 This is it! We’re officially out of beta and excited to announce the general availability of Datakin. Our story began with the creation of Marquez over two years ago. We believed then, and still believe now, that a new approach to data lineage was essential to support today’s pipelines.

article thumbnail

6 Automated Data Capture Methods For Business Development

InData Labs

Today, digitization penetrates all spheres of business. 2.5 quintillion bytes of data that people create every day is predominantly unstructured data. Whether it is audio, video or text, big data – if meticulously collected, recognized, and processed – can generate business value through leveraging state-of-the-art technologies. But no matter how intelligent machines may be, they.

Bytes 52
article thumbnail

Data Observability: Five Quick Ways to Improve the Reliability of Your Data

Monte Carlo

If your data breaks, does it make a sound? Odds are, the answer is yes. But will you hear it? Probably not. Nowadays, organizations ingest large amounts of data across increasingly complex ecosystems, and very often their data breaks silently, and as a result data teams are left in the dark – until it’s too late. But, if said data is a report used by your Chief Revenue Officer to determine next quarter’s forecast, chances are this data will make a very, very large sound.

Data 52
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Partnerships that Enrich Solutions: a Spotlight Interview with Dell Enterprise Germany’s General Manager, Benjamin Krebs

Cloudera

During this Partner Perspective interview, Cloudera’s Alvin Heib seizes the opportunity to speak with Benjamin Krebs, General Manager of Technology Enterprise in Germany. The pair discuss Benjamin’s role at Dell, the importance of partnerships in his region, how the pandemic has altered Dell’s working landscape and finally, some predictions Benjamin has on Dell’s future.

article thumbnail

What is Data Hub: Purpose, Architecture Patterns, and Existing Solutions Overview

AltexSoft

The larger the company, the more data it has to generate actionable insights. Yet, more than often, businesses can’t make use of their most valuable asset — information. Why? Because it is scattered across disparate systems, hardly available for analytical apps. Evidently, common storage solutions fail to provide a unified data view and meet the needs of companies for seamless data flow.

article thumbnail

Event Streaming in Apache Pulsar with Scala

Rock the JVM

Apache Pulsar is a cloud-native, distributed messaging and streaming platform handling hundreds of billions of events daily: discover its strengths and see how to use Scala with the pulsar4s client library to interact with it

Scala 52
article thumbnail

Bob Muglia, former Snowflake CEO, to Speak at IMPACT, the World’s First Data Observability Summit

Monte Carlo

Today, we’re thrilled to announce that Bob Muglia , entrepreneur, Fivetran board member, and former CEO of Snowflake, and DJ Patil, the first U.S. Chief Data Scientist, will speak at IMPACT: The Data Observability Summit. Muglia’s fireside chat with Monte Carlo CEO Barr Moses will cap off the event, and touch on such topics as the rise of data in the cloud, challenges and opportunities in the current tooling landscape, and his vision for the future of data engineering and analytics.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

AWS Kinesis Firehose and Teradata Vantage

Teradata

Many Teradata customers are interested in integrating Vantage with Amazon AWS First Party Services. This Getting Started Guide will help you to connect Vantage with AWS Kinesis service.

AWS 52
article thumbnail

The Data Janitor Letters - August 2021

Pipeline Data Engineering

Data engineering salon. News and interesting reads about the world of data. From Data Driven to Driving Data — The dysfunctions of Data Engineering MrTrustworthy Many “data driven” initiatives are failing even though they had the best engineers on the task and picked the “best” stack of technologies. What's an OLAP cube? ? Claire Carroll, Analytics Engineer, analyticsengineers.club OLAP cubes were this intimidating concept, and the more they read, the less they understood, but it turns out that

Hadoop 52
article thumbnail

Custom Pattern Matching in Scala

Rock the JVM

Pattern matching is one of Scala's most powerful features: discover how to customize it and create your own patterns in this article

Scala 52
article thumbnail

How We Improved the Concurrency and Scalability of Our Redis Rate Limiting System

Rockset

Background Rate limiting is a technique used to protect services from overload. In addition, it can be used to prevent starvation of a multi-tenant resource by a few very large customers. At Rockset, we primarily use rate limiting to protect our: metadata store from overload caused by too many API requests. log store from filling up due to mismatched input and output rates control plane from too many state transitions.

Systems 52
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Tracing SRE’s journey in Zalando - Part II

Zalando Engineering

Welcome to the second part of our journey establishing SRE in Zalando. You’ll find the first part here. Don’t miss out on the third and final post in one week. 2018 - The Return of SRE In our previous blog post we left it with the plans for Site Reliability Engineering (SRE) in Zalando having to change. So, what were those changes and what were the challenges we faced in this new iteration?

article thumbnail

Flexibility and Resiliency Across the Supply Chain

Teradata

The supply chain is not just the sum of its parts. Each function, organization, decision & action are connected & have an effect on each part of the supply chain. Find out more.

IT 52
article thumbnail

Streaming Events From Salesforce for Lead Enrichment With RudderStack’s Webhook Source

RudderStack

How to use a webhook to stream new ‘lead created’ events from Salesforce through Rudderstack for lead enrichment w/ Clearbit data then back to Salesforce.

Data 40
article thumbnail

ACID vs BASE Concepts

Data Science Blog: Data Engineering

Understanding databases for storing, updating and analyzing data requires the understanding of two concepts: ACID and BASE. This is the first article of the article series Data Warehousing Basics. The properties of ACID are being applied for databases in order to fulfill enterprise requirements of reliability and consistency. ACID is an acronym, and stands for: Atomicity – Each transaction is either properly executed completely or does not happen at all.

NoSQL 52
article thumbnail

The Cloud Development Environment Adoption Report

Cloud Development Environments (CDEs) are changing how software teams work by moving development to the cloud. Our Cloud Development Environment Adoption Report gathers insights from 223 developers and business leaders, uncovering key trends in CDE adoption. With 66% of large organizations already using CDEs, these platforms are quickly becoming essential to modern development practices.