February, 2018

article thumbnail

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Data Engineering Podcast

Summary One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions abou

Kafka 100
article thumbnail

Code Migration in Production: Rewriting the Sharding Layer of Uber’s Schemaless Datastore

Uber Engineering

In 2014, Uber Engineering built Schemaless , our fault-tolerant and scalable datastore, to facilitate the rapid growth of our company. For context, we deployed more than 40 Schemaless instances and many thousands of storage nodes in 2016 alone. As our … The post Code Migration in Production: Rewriting the Sharding Layer of Uber’s Schemaless Datastore appeared first on Uber Engineering Blog.

Coding 93
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Zalando @ FOSDEM

Zalando Engineering

Why FOSDEM is not your average conference I could get cheeky with semantics and point out that the “M” in FOSDEM stands for “Meeting”. But I’ll play nice and focus instead on the specifics of the event itself. FOSDEM has been running since 2001. In that time, it has grown to become the open source community event for Europe. Over a two-day event, thousands of attendees descend upon the ULB in Brussels to attend what is, in reality, a collection of conferences.

article thumbnail

Concurrency, MySQL and Node.js: A journey of discovery

nodeSWAT

Our story begins like so many others with a code loving protagonist — someone we all can relate to. His days are largely filled with designing code, writing code and reading about code — keeping clients happy while learning and having fun. This has been going on for years now with both MySQL and Node.js among others and as such our protagonist considers himself quite proficient with both those technologies.

MySQL 52
article thumbnail

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

article thumbnail

Recap of Hadoop News for January 2018

ProjectPro

News on Hadoop - Janaury 2018 Apache Hadoop 3.0 goes GA, adds hooks for cloud and GPUs.TechTarget.com, January 3, 2018. The latest update to the 11 year old big data framework Hadoop 3.0 allows cluster pooling on GPU resources , reduces storage requirements, and adds a novel federation scheme that lets YARN resource manager and the job scheduler expand the number of nodes which can run within a Hadoop cluster.

Hadoop 52
article thumbnail

Breaking down data silos: when SAP alone is not enough

Cloudera

Running a large company is impossible without having an ERP system in place, and SAP business software remains at the forefront in this category. But when companies are looking towards new technologies such as data lakes, machine learning or predictive analytics, SAP alone is just not enough. To keep up with tech trends, businesses have to face the challenges of integrating SAP with non-SAP technologies and embark on a crusade against data silos.

More Trending

article thumbnail

Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka

Uber Engineering

In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible. Given the scope … The post Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka appeared first on Uber Engineering Blog.

Kafka 109
article thumbnail

Data Analysis with Spark

Zalando Engineering

Apache’s lightning fast engine for data analysis and machine learning In recent years, there has been a massive shift in the industry towards data-oriented decision making backed by enormously large data sets. This means that we can serve our customers with more relevant, personalized content. We in the Digital Experience team are tasked with analysing Big Data in order to gather insights and support the product team with the decision making process.

article thumbnail

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Podcast

Summary As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers.

article thumbnail

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast.

Kafka 100
article thumbnail

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

article thumbnail

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast.

Kafka 100
article thumbnail

2017 – Another Award-Winning Year for Cloudera!

Cloudera

In many ways, 2017 was a singular year for Cloudera, not least because we staged a successful IPO and joined the ranks of the world’s fastest-growing, publicly traded companies. We deeply appreciate the vote of confidence and trust our customers have placed in us and are proud of the hard work of our 1,600-plus employees. These are some of the year’s highlights.

article thumbnail

Cloudera on Cloudera: Our Journey to Becoming more Data-driven

Cloudera

I’ve spent the last four years here at Cloudera talking with our customers about how to run their businesses better using their data and Cloudera’s products and services. Now I get to put my money where my mouth is – and turn my focus internally on how we at Cloudera can become more data-driven. We aspire to and are on the journey to be the best-run company on data, and to be our own best reference.

article thumbnail

Cybersecurity on Call: Nation-State Cyber Operations with Patrick Tucker

Cloudera

As cyber attacks continue to increase across the world, it has become more critical for countries to implement cyber operations from a defensive and offensive perspective to protect national secrets and their citizens. An Arizona State University research paper showed just how global this problem is when they discovered that if hackers discussed a zero-day exploit on the dark web in Chinese the likelihood of a hacker exploiting the vulnerability was 9%.

article thumbnail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

article thumbnail

Dave Shuman Talks IoT and Big Data on Federal News Radio

Cloudera

What exactly can we expect for IoT in 2018, and how can you improve your organization with connected devices? That was the question Dave Shuman set out to answer when he sat down last month with John Gilroy at the Federal News Radio headquarters in Washington, D.C. Federal Tech Talk looks at the world of high technology in the federal government and, as its host, John speaks the language of federal CISOs, CIOs, and CTOs.

article thumbnail

Innovation in Digital Experience

Zalando Engineering

Multi-functional teams make for a greater customer journey When I started in Zalando Tech, I hadn’t worked with a product manager before, and I had probably never seen a UX designer, a UI designer, a researcher or a business developer before either. My world was data science, more specifically, personalization and recommender systems. In this isolated bubble, data scientists often thought we could solve all problems without help, but in the last two years, I came to understand why we need to sto

article thumbnail

Five Minutes from Machine Learning to RESTful API

Zalando Engineering

The benefits of Connexion: Zalando’s open source API-First framework In this article, I will show how quick and simple it can be to create a RESTful API for a machine learning model using Zalando’s open source Swagger/OpenAPI First framework called Connexion. Official documentation describes Connexion as the following: “Connexion is a framework on top of Flask that automagically handles HTTP requests based on OpenAPI 2.0 Specification (formerly known as Swagger Spec) of your API described in YAM

article thumbnail

Cross-Lingual End-to-End Product Search with Deep Learning

Zalando Engineering

How We Built the Next Generation Product Search from Scratch using a Deep Neural Network Product search is one of the key components in an online retail store. A good product search can understand a user’s query in any language, retrieve as many relevant products as possible, and finally present the results as a list in which the preferred products should be at the top, and the less relevant products should be at the bottom.

article thumbnail

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

article thumbnail

Crushing AVRO Small Files with Spark

Zalando Engineering

Solving the many small files problem for AVRO The Fashion Content Platform teams in Zalando Dublin handle large amounts of data on a daily basis. To make sense of it all, we utilise Hadoop (EMR) on AWS. Within this post, we discuss a system where a real-time system feeds the data. Due to the variance in data volumes and the period that these systems write to storage, there can be a large number of small files.

Hadoop 40