Sat.Feb 03, 2018 - Fri.Feb 09, 2018

article thumbnail

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast.

Kafka 100
article thumbnail

Concurrency, MySQL and Node.js: A journey of discovery

nodeSWAT

Our story begins like so many others with a code loving protagonist — someone we all can relate to. His days are largely filled with designing code, writing code and reading about code — keeping clients happy while learning and having fun. This has been going on for years now with both MySQL and Node.js among others and as such our protagonist considers himself quite proficient with both those technologies.

MySQL 52
article thumbnail

Cybersecurity on Call: Nation-State Cyber Operations with Patrick Tucker

Cloudera

As cyber attacks continue to increase across the world, it has become more critical for countries to implement cyber operations from a defensive and offensive perspective to protect national secrets and their citizens. An Arizona State University research paper showed just how global this problem is when they discovered that if hackers discussed a zero-day exploit on the dark web in Chinese the likelihood of a hacker exploiting the vulnerability was 9%.

article thumbnail

Cross-Lingual End-to-End Product Search with Deep Learning

Zalando Engineering

How We Built the Next Generation Product Search from Scratch using a Deep Neural Network Product search is one of the key components in an online retail store. A good product search can understand a user’s query in any language, retrieve as many relevant products as possible, and finally present the results as a list in which the preferred products should be at the top, and the less relevant products should be at the bottom.

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast.

Kafka 100
article thumbnail

Crushing AVRO Small Files with Spark

Zalando Engineering

Solving the many small files problem for AVRO The Fashion Content Platform teams in Zalando Dublin handle large amounts of data on a daily basis. To make sense of it all, we utilise Hadoop (EMR) on AWS. Within this post, we discuss a system where a real-time system feeds the data. Due to the variance in data volumes and the period that these systems write to storage, there can be a large number of small files.

Hadoop 40