January, 2018

article thumbnail

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

Batch data processing  — historically known as ETL —  is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. This post distills fragments of wisdom accumulated while working at Yahoo, Facebook, Airbnb and Lyft, with the perspective of well over a decade of data warehousing

article thumbnail

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

Data Engineering Podcast

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used.

Data 100
article thumbnail

Do These Things if you Want to Succeed as an HR Professional

U-Next

Success in today’s businesses has taken several meanings. Apart from just pay hikes and promotions, success has gotten new dimensions that have been of very recent origins. Today, success has become synonymous with happiness at a workplace, challenging tasks, compensatory rewards, incentives, authoritative job profiles, influential role, and more. The current talent pools in organizations have become wiser and more mature than their previous generation counterparts.

article thumbnail

Data Engineering is Critical to Big Data Success

Cloudera

I mentioned in an earlier blog titled, “Staffing your big data team, ” that data engineers are critical to a successful data journey. That said, most companies that are early in their journey lack a dedicated engineering group. And the longer it takes to put a team in place, the likelier it is that your big data project will stall. The data engineering team is responsible for collecting and ingesting batch and stream-oriented data, inventorying the data, working through ingest bottlenecks, and d

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

The Faces Behind the Fashion-MNIST

Zalando Engineering

We talk to Han and Kashif from Zalando Research Employer Branding Specialist Data Science, Nana Yamazaki catches up with the team using literal fashion icons in Deep Learning. Tell us about Fashion-MNIST. What did you want to accomplish? Fashion-MNIST is a freely available dataset of Zalando articles that most importantly has the same format as the MNIST dataset.

article thumbnail

Postgres Internals: Building a Description Tool

Dataquest

In previous blog posts , we have described the Postgres database and ways to interact with it using Python. Those posts provided the basics, but if you want to work with databases in production systems, then it is necessary to know how to make your queries faster and more efficient. To understand what efficiency means in Postgres, it’s important to learn how Postgres works under the hood.

More Trending

article thumbnail

CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

Data Engineering Podcast

Summary As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution.

article thumbnail

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Data Engineering Podcast

Summary PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance.

article thumbnail

The Top 10 Most Popular VISION Blogs of 2017

Cloudera

The New Year is a great time to make resolutions, but it’s also a great time to reflect on the previous year. Before we get too far into 2018, let’s take a look at the ten most popular Cloudera VISION blogs from 2017. Today is an important day in the life of Cloudera. On April 28, 2017, Mike Olson , as one of the founders of Cloudera, writes about the initial public offering, and what the milestone means.

article thumbnail

The three certainties in life: death, taxes and GDPR

Cloudera

As the GDPR clock ticks down to implementation, it is clear that this will not be a non-event like the Millennium Bug – it will happen and there will be dire consequences, potentially company-closures, in the event of non-compliance. The three certainties in life: death, taxes and GDPR. 1999 was a milestone year for the development of technology.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Cybersecurity On Call: Goodbye 2017, Hello 2018! Top Five Tips from 2017

Cloudera

This was an amazing year for our inaugural “Cybersecurity On Call” season. It was truly an honor hosting amazing guests as we explored the world of cybersecurity. From industry thought leaders, to New York Times best sellers, to hackers, I learned a ton about the future of cybersecurity and I hope you did as well. Today’s episode won’t be our usual programming, today is our end of the year special where we will dive into our top five tips from this year’s season.

article thumbnail

Breaking through the clouds in Asia Pacific

Cloudera

To quote Sam Walton, Walmart’s founder, “There is only one boss. The customer. And he can fire everybody in the company from the chairman on down, simply by spending his money somewhere else”. This very much forms the lens for our focus here at Cloudera Asia Pacific. And it is this unwavering passion and commitment that has driven the team to strive for the very best for our customers and partners, and milestones that we have collectively attained since 2015.

Cloud 40
article thumbnail

Six Strategies for Advancing Customer Knowledge: Bringing Data Together

Cloudera

I often meet with our customers to help them understand how to connect modern technology to business success. The ever-present question at these encounters is “Where do I start?” For them, they may understand that they need a data-driven strategy or the culture may aim to take a shift to being guided by data. These are often goals set by the executive team with little guidance on how to execute or implement.

article thumbnail

Staffing your big data team

Cloudera

Building the right team is as important as assembling the right IT infrastructure – and the needs differ just as dramatically. A traditional BI and analytics organization consists of three main groups: Analysts that develop reports often using sample data. The data management team – modelers that take requests, find data, and develop models to answer the questions.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Rabbit in the Cloud

Zalando Engineering

How we deployed RabbitMQ on AWS In an effort to move away from our legacy monolithic service, we decided take on the challenge of building a new communication platform based on a micro service architecture, which would be more focused and more easily manageable. The challenge was exciting and big; we had to make crucial decisions early on, decisions that we would have to live with for the foreseeable future.

Cloud 40
article thumbnail

Building a Better Tech Radar

Zalando Engineering

How Zalando helps its engineering teams navigate the tech landscape Zalando has more than 200 engineering teams, which regularly face tricky technology choices. To help them make good decisions, we created the Zalando Tech Radar as a "navigation" tool. Inspired by ThoughtWorks , it assigns each technology to one of four rings — Adopt, Trial, Assess and Hold — which represents the current consensus within Zalando.

article thumbnail

Simplicity by Distributing Complexity

Zalando Engineering

Building an aggregated view of data in the event-driven microservice architecture In the world of microservices, where a domain model gets decomposed into related, but independently handled entities, we often face the challenge of building an aggregate view of the data that brings together different parts of that model. While this can already be interesting with “traditional” designs, the move to event-driven architectures can magnify these difficulties, especially with simplistic event streams.

Media 40
article thumbnail

Why We Do Scala in Zalando

Zalando Engineering

Leveraging the full power of a functional programming language In Zalando Dublin, you will find that most engineering teams are writing their applications using Scala. We will try to explain why that is the case and the reasons we love Scala. This content is coming both from my own experience and the team I'm working with in building the new Zalando Customer Data Platform.

Scala 40
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Rock Solid Kafka and ZooKeeper Ops on AWS

Zalando Engineering

Reducing ops effort while maintaining Kafka and Zookeeper This post is targeted to those looking for ways to reduce ops effort while maintaining Kafka and Zookeeper deployments on AWS and also improving their availability and stability. In a nutshell, we are going to explain how using Elastic Network Interfaces can improve over a straight out of the box setup.

Kafka 40
article thumbnail

Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15

Data Engineering Podcast

Summary The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by

article thumbnail

Drawn Together

Zalando Engineering

How to talk about design in the agile world How we improved design communication in the Retail Ops Team With an agile and lean approach, most of us here at Zalando changed the way we build digital products. Design processes also evolved,  with  designers usually working alongside cross-functional product teams. But, at first, one thing did not change too much: how we talk about the design.