February, 2018

article thumbnail

Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka

Uber Engineering

In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible. Given the scope … The post Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka appeared first on Uber Engineering Blog.

Kafka 109
article thumbnail

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Data Engineering Podcast

Summary One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions abou

Kafka 100
article thumbnail

Zalando @ FOSDEM

Zalando Engineering

Why FOSDEM is not your average conference I could get cheeky with semantics and point out that the “M” in FOSDEM stands for “Meeting”. But I’ll play nice and focus instead on the specifics of the event itself. FOSDEM has been running since 2001. In that time, it has grown to become the open source community event for Europe. Over a two-day event, thousands of attendees descend upon the ULB in Brussels to attend what is, in reality, a collection of conferences.

article thumbnail

Concurrency, MySQL and Node.js: A journey of discovery

nodeSWAT

Our story begins like so many others with a code loving protagonist — someone we all can relate to. His days are largely filled with designing code, writing code and reading about code — keeping clients happy while learning and having fun. This has been going on for years now with both MySQL and Node.js among others and as such our protagonist considers himself quite proficient with both those technologies.

MySQL 52
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Recap of Hadoop News for January 2018

ProjectPro

News on Hadoop - Janaury 2018 Apache Hadoop 3.0 goes GA, adds hooks for cloud and GPUs.TechTarget.com, January 3, 2018. The latest update to the 11 year old big data framework Hadoop 3.0 allows cluster pooling on GPU resources , reduces storage requirements, and adds a novel federation scheme that lets YARN resource manager and the job scheduler expand the number of nodes which can run within a Hadoop cluster.

Hadoop 52
article thumbnail

Breaking down data silos: when SAP alone is not enough

Cloudera

Running a large company is impossible without having an ERP system in place, and SAP business software remains at the forefront in this category. But when companies are looking towards new technologies such as data lakes, machine learning or predictive analytics, SAP alone is just not enough. To keep up with tech trends, businesses have to face the challenges of integrating SAP with non-SAP technologies and embark on a crusade against data silos.

More Trending

article thumbnail

Data Teams with Will McGinnis - Episode 19

Data Engineering Podcast

Summary The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.

article thumbnail

Data Analysis with Spark

Zalando Engineering

Apache’s lightning fast engine for data analysis and machine learning In recent years, there has been a massive shift in the industry towards data-oriented decision making backed by enormously large data sets. This means that we can serve our customers with more relevant, personalized content. We in the Digital Experience team are tasked with analysing Big Data in order to gather insights and support the product team with the decision making process.

article thumbnail

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Podcast

Summary As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers.

article thumbnail

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast.

Kafka 100
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast.

Kafka 100
article thumbnail

2017 – Another Award-Winning Year for Cloudera!

Cloudera

In many ways, 2017 was a singular year for Cloudera, not least because we staged a successful IPO and joined the ranks of the world’s fastest-growing, publicly traded companies. We deeply appreciate the vote of confidence and trust our customers have placed in us and are proud of the hard work of our 1,600-plus employees. These are some of the year’s highlights.

article thumbnail

Cloudera on Cloudera: Our Journey to Becoming more Data-driven

Cloudera

I’ve spent the last four years here at Cloudera talking with our customers about how to run their businesses better using their data and Cloudera’s products and services. Now I get to put my money where my mouth is – and turn my focus internally on how we at Cloudera can become more data-driven. We aspire to and are on the journey to be the best-run company on data, and to be our own best reference.

article thumbnail

Cybersecurity on Call: Nation-State Cyber Operations with Patrick Tucker

Cloudera

As cyber attacks continue to increase across the world, it has become more critical for countries to implement cyber operations from a defensive and offensive perspective to protect national secrets and their citizens. An Arizona State University research paper showed just how global this problem is when they discovered that if hackers discussed a zero-day exploit on the dark web in Chinese the likelihood of a hacker exploiting the vulnerability was 9%.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Dave Shuman Talks IoT and Big Data on Federal News Radio

Cloudera

What exactly can we expect for IoT in 2018, and how can you improve your organization with connected devices? That was the question Dave Shuman set out to answer when he sat down last month with John Gilroy at the Federal News Radio headquarters in Washington, D.C. Federal Tech Talk looks at the world of high technology in the federal government and, as its host, John speaks the language of federal CISOs, CIOs, and CTOs.

article thumbnail

Innovation in Digital Experience

Zalando Engineering

Multi-functional teams make for a greater customer journey When I started in Zalando Tech, I hadn’t worked with a product manager before, and I had probably never seen a UX designer, a UI designer, a researcher or a business developer before either. My world was data science, more specifically, personalization and recommender systems. In this isolated bubble, data scientists often thought we could solve all problems without help, but in the last two years, I came to understand why we need to sto

article thumbnail

Cross-Lingual End-to-End Product Search with Deep Learning

Zalando Engineering

How We Built the Next Generation Product Search from Scratch using a Deep Neural Network Product search is one of the key components in an online retail store. A good product search can understand a user’s query in any language, retrieve as many relevant products as possible, and finally present the results as a list in which the preferred products should be at the top, and the less relevant products should be at the bottom.

article thumbnail

Crushing AVRO Small Files with Spark

Zalando Engineering

Solving the many small files problem for AVRO The Fashion Content Platform teams in Zalando Dublin handle large amounts of data on a daily basis. To make sense of it all, we utilise Hadoop (EMR) on AWS. Within this post, we discuss a system where a real-time system feeds the data. Due to the variance in data volumes and the period that these systems write to storage, there can be a large number of small files.

Hadoop 40
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Five Minutes from Machine Learning to RESTful API

Zalando Engineering

The benefits of Connexion: Zalando’s open source API-First framework In this article, I will show how quick and simple it can be to create a RESTful API for a machine learning model using Zalando’s open source Swagger/OpenAPI First framework called Connexion. Official documentation describes Connexion as the following: “Connexion is a framework on top of Flask that automagically handles HTTP requests based on OpenAPI 2.0 Specification (formerly known as Swagger Spec) of your API described in YAM