October, 2020

article thumbnail

Top 5 Things Every Kafka Developer Should Know

Confluent

Apache Kafka® is an event streaming platform used by more than 30% of the Fortune 500 today. There are numerous features of Kafka that make it the de-facto standard for […].

Kafka 145
article thumbnail

UK Government: From cloud first to cloud appropriate?

Cloudera

Since 2013 the UK Government’s flagship ‘Cloud First’ policy has been at the forefront of enabling departments to shed their legacy IT architecture in order to meaningfully embrace digital transformation. The policy outlines that the cloud (and specifically, public cloud) be the default position for any new services; unless it can be demonstrated that other alternatives offer better value for money. .

article thumbnail

How to submit Spark jobs to EMR cluster from Airflow

Start Data Engineering

Table of Contents Table of Contents Introduction Design Setup Prerequisites Clone repository Get data Code Move data and script to the cloud create an EMR cluster add steps and wait to complete terminate EMR cluster Run the DAG Conclusion Further reading Introduction I have been asked and seen the questions how others are automating apache spark jobs on EMR how to submit spark jobs to an EMR cluster from Airflow ?

Cloud 130
article thumbnail

Data: The Crumbling Foundation of Finance, Our Once Trusted Advisor

Teradata

The most frequently asked question of Finance departments today is, ‘whose data do we trust’? Here’s how to ensure Finance always has the correct answer.

Finance 117
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Netflix Android and iOS Studio Apps?—?now powered by Kotlin Multiplatform

Netflix Tech

Netflix Android and iOS Studio Apps?—?now powered by Kotlin Multiplatform By David Henry & Mel Yahya Over the last few years Netflix has been developing a mobile app called Prodicle to innovate in the physical production of TV shows and movies. The world of physical production is fast-paced, and needs vary significantly between the country, region, and even from one production to the next.

Coding 112
article thumbnail

Cloud Native Data Security As Code With Cyral

Data Engineering Podcast

Summary One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other.

More Trending

article thumbnail

Data security vs usability: you can have it all

Cloudera

Growing up, were you ever told you can’t have it all? That you can’t eat all the snacks in one sitting? That you can’t watch the complete Back to the Future trilogy as well as study for your science exam in one evening? Over time, we learn to set priorities, make a decision for one thing over the other, and compromise. Just like when it comes to data access in business.

article thumbnail

How Grouparoo works as a team

Grouparoo

When Brian, Evan, and I first talked about starting a company, we already had some ideas in mind about what we might want to do differently from our past roles. The three of us had all worked together before at TaskRabbit , but since we were starting a brand new company, we decided to approach how we would work from a first principles approach. I thought we’d share some tidbits about how we work right now.

article thumbnail

Why the Single Source of Truth Paradigm in Data Warehousing is Outdated

Teradata

The old paradigm of the data warehouse serving as the single source of truth in today's ever evolving data landscape can no longer be sustained. Find out why.

article thumbnail

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Netflix Tech

By Tianlong Chen and Ioannis Papapanagiotou Netflix has more than 195 million subscribers that generate petabytes of data everyday. Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto , to process this data and perio

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Better Data Quality Through Observability With Monte Carlo

Data Engineering Podcast

Summary In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests.

article thumbnail

Introducing Confluent Platform 6.0

Confluent

Each month, we’ve announced a set of Confluent features organized around what we think are the key foundational traits of cloud-native data systems as part of Project Metamorphosis. Data systems […].

Project 143
article thumbnail

How-to: Index Data from S3 via NiFi Using CDP Data Hubs

Cloudera

About this Blog. Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e.

AWS 121
article thumbnail

Akka Typed: How the Pipe Pattern Prevents Anti-Patterns

Rock the JVM

Discover how Akka Typed revolutionizes actor protocol definitions and dramatically enhances actor mechanics

52
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

How to Prioritize "Self" in Today's World: A Summary on Mental Health

Teradata

In honor of World Mental Health Day this past weekend, Shehzeen Rehman writes on the importance of de-stigmatizing mental health and learning how to seek help.

98
article thumbnail

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

by Maulik Pandey Our Team?—? Kevin Lew , Narayanan Arunachalam , Elizabeth Carretto , Dustin Haffner , Andrei Ushakov, Seth Katz , Greg Burrell , Ram Vaithilingam , Mike Smith and Maulik Pandey “ @Netflixhelps Why doesn’t Tiger King play on my phone?”?—?a Netflix member via Twitter This is an example of a question our on-call engineers need to answer to help resolve a member issue?

article thumbnail

Rapid Delivery Of Business Intelligence Using Power BI

Data Engineering Podcast

Summary Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options.

article thumbnail

Preparing Your Clients and Tools for KIP-500: ZooKeeper Removal from Apache Kafka

Confluent

As described in the blog post Apache Kafka® Needs No Keeper: Removing the Apache ZooKeeper Dependency, when KIP-500 lands next year, Apache Kafka will replace its usage of Apache ZooKeeper […].

Kafka 140
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Cloudera

Background. Why choose K8s for Apache Spark. Apache Spark unifies batch processing, real-time processing, stream analytics, machine learning, and interactive query in one-platform. While Apache Spark provides a lot of capabilities to support diversified use cases, it comes with additional complexity and high maintenance costs for cluster administrators.

Big Data 121
article thumbnail

Akka Typed Actors: Stateful and Stateless

Rock the JVM

Akka Typed has transformed actor creation: in this article, we explore various methods for managing state within Akka actors

article thumbnail

Watch Out for Gotchas in Cloud Data Warehouse Pricing

Teradata

Successful companies need to squeeze maximum value from all of their data & do it at the lowest possible cost. But they often get hit with unexpected budget overruns. Teradata can help.

article thumbnail

A Day in the Life of a Content Analytics Engineer

Netflix Tech

Part of our series on who works in Analytics at Netflix?—?and what the role entails by Rocio Ruelas Back when we were all working in offices, my favorite days were Monday, Wednesday, and Friday. Those were the days with the best hot breakfast, and I’ve always been a sucker for free food. I started the day by arriving at the LA office right before 8am and finding a parking spot close to the entrance.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Self Service Real Time Data Integration Without The Headaches With Meroxa

Data Engineering Podcast

Summary Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading.

article thumbnail

How Real-Time Materialized Views Work with ksqlDB, Animated

Confluent

All around the world, companies are asking the same question: What is happening right now? We are inundated with pieces of data that have a fragment of the answer. But […].

Data 124
article thumbnail

What you need to know to begin your journey to CDP

Cloudera

Recently, my colleague published a blog build on your investment by Migrating or Upgrading to CDP Data Center , which articulates great CDP Private Cloud Base features. Existing CDH and HDP customers can immediately benefit from this new functionality. This blog focuses on the process to accelerate your CDP journey to CDP Private Cloud Base for both professional services engagements and self-service upgrades.

article thumbnail

Fullstack Typescript - create an API

Grouparoo

Two of the major components of the @grouparoo/core application are a Node.js API server and a React frontend. We use Actionhero as the API server, and Next.JS for our React site generator. As we develop the Grouparoo application, we are constantly adding new API endpoints and changing existing ones. One of the great features of Typescript is that it can help not only to share type definitions within a codebase, but also across multiple codebases or services.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Announcing Vantage on Google Cloud

Teradata

Teradata Vantage on Google Cloud is now generally available! Vantage on Google Cloud is an as-a-service offer in which customers can get the most analytic value from their data. Read more.

article thumbnail

Broadcast Joins in Apache Spark: An Optimization Technique

Rock the JVM

Broadcast joins in Apache Spark are a highly effective technique for boosting performance and avoiding memory issues, offering great value for optimization

52
article thumbnail

The Curse of Dimensionality

Domino Data Lab: Data Engineering

Danger of Big Data Big data is the rage. This could be lots of rows (samples) and few columns (variables) like credit card transaction data, or lots of columns (variables) and few rows (samples) like genomic sequencing in life sciences research. The Curse of Dimensionality , or Large P, Small N, ((P >> N)) problem applies to the latter case of lots of variables measured on a relatively few number of samples.

article thumbnail

ksqlDB Meets Java: An IoT-Inspired Demo of the Java Client for ksqlDB

Confluent

Stream processing applications, including streaming ETL pipelines, materialized caches, and event-driven microservices, are made easy with ksqlDB. Until recently, your options for interacting with ksqlDB were limited to its command-line […].

Java 122
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.