November, 2018

article thumbnail

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Späti

These days, everyone talks about open-source. However, this is still not common in the Data Warehouse (DWH) field. Why is this? In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system. I went with Apache Druid for data storage, Apache Superset for querying and Apache Airflow as a task orchestrator.

article thumbnail

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber Engineering

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex … The post Observability at Scale: Building Uber’s Alerting Ecosystem appeared first on Uber Engineering Blog.

Building 104
article thumbnail

Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58

Data Engineering Podcast

Summary When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform.

Data Lake 100
article thumbnail

Netflix Information Security: Preventing Credential Compromise in AWS

Netflix Tech

by Will Bengtson Previously we wrote about a method for detecting credential compromise in your AWS environment. The methodology focused on a continuous learning model and first use principle. This solution still is reactive in nature?—?we only detect credential compromise after it has already happened. Even with detection capabilities, there is a risk that exposed credentials can provide access to sensitive data and/or the ability to cause damage in our environment.

AWS 99
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Collaboration Between Data Science and Data Engineering: True or False?

Domino Data Lab: Data Engineering

This blog post includes candid insights about addressing tension points that arise when people collaborate on developing and deploying models. Domino’s Head of Content sat down with Don Miner and Marshall Presser to discuss the state of collaboration between data science and data engineering. The blog post provides distilled insights, audio clips, excerpted quotes as well as the full audio and written transcript.

article thumbnail

Five strategies for skills-based volunteering: Lessons learned from Cloudera Cares first-ever Global Day of Service

Cloudera

Corporate volunteering is on the rise. However, only half of companies encourage their employees to participate in skills-based volunteering – defined as employees applying their abilities and specialized talents to challenges facing their communities. As the Program Manager for Cloudera Cares, Cloudera’s employee giving and volunteering program at the Cloudera Foundation, I believe that we can have more impact if we offer employees opportunities for skills-based volunteering.

Food 45

More Trending

article thumbnail

Tag-based Navigation of a Fashion Catalog

Zalando Engineering

Exploring the Zalando Assortment by Browsing a Product Similarity Graph Introduction As Europe's leading online fashion and lifestyle platform, Zalando is continually developing new features to enable our customers to find the products they want. While the standard tools of Search, Categorization & Attribute Filtering are par-for-the-course for purchasing items online, with an ever-expanding fashion assortment and an increase in the data available to describe a product, this browsing experie

article thumbnail

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Data Engineering Podcast

Summary Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start us

Process 100
article thumbnail

Delivering Meaning with Previews on Web

Netflix Tech

By Corey Grunewald and Tony Casparro As the Netflix catalog of films and series continues to grow, it becomes more challenging to present members with enough information to decide what to watch. How can a member tell if a movie is both a horror and a comedy? The synopsis and artwork help provide some context, but how can we leverage video previews (trailers) to help members find something great to watch?

article thumbnail

Rockset's RocksDB-Cloud Library - Enabling the Next Generation of Cloud Native Databases

Rockset

Rockset and I began collaborating in 2016 due to my interest in their RocksDB-Cloud open-source key-value store. This post is primarily about the RocksDB-Cloud software, which Rockset open-sourced in 2016, rather than Rockset's newly launched cloud service. In it, I will explore how RocksDB-Cloud can be used to build an open-source cloud-friendly storage system.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Cloudera Named a Fastest Growing Company by Deloitte for Fourth Year

Cloudera

For the fourth time in the past five years, Cloudera has been named to Deloitte’s Technology Fast 500 as one of the fastest growing companies in North America. This annual ranking showcases the growth of companies in the technology, media, telecommunications, life sciences, and energy tech sectors. This year’s list demonstrated the power of combining breakthrough research and development, entrepreneurship and rapid growth, with software companies like Cloudera making up nearly two-thirds of the

article thumbnail

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Data Engineering Podcast

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Data Lake 100
article thumbnail

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Data Engineering Podcast

Summary Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page.

article thumbnail

Netflix at AWS re:Invent 2018

Netflix Tech

by Shaun Blackburn AWS re:Invent is back in Las Vegas this week! Many Netflix engineers and leaders will be among the 40,000 attending the conference to connect with fellow cloud and OSS enthusiasts. You can find us at our booth on the expo floor, speaking on a variety of subjects, and at meetups and events around the re:Invent campus. We have listed all our talks below to make it easy to hear what we have been up to.

AWS 46
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

An introduction to Federated Learning

Cloudera

We’re excited to release Federated Learning , the latest report and prototype from Cloudera Fast Forward Labs. Federated learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy and reduces communication costs. This article is about the business case for federated learning.

article thumbnail

Zalando Postgres Operator: One Year Later

Zalando Engineering

Zalando Postgres operator: one year later The Postgres operator provides a managed Postgres service for Kubernetes. It extends the Kubernetes API with a custom “postgresql” resource that describes desired characteristics of a Postgres cluster, monitors updates of this resource and adjusts Postgres clusters accordingly. Zalando successfully uses the operator to manage more than 450 Postgres clusters across a large number of Kubernetes installations.

article thumbnail

Zalando Research Releases “Flair”

Zalando Engineering

Open sourcing machine learning research for natural language processing (NLP) Two years ago, Zalando Research launched with a clear purpose to ensure that Zalando Tech is at the forefront of research in the areas of data science, machine learning, natural language processing and artificial intelligence. Our researchers’ work previously focused mainly within Zalando.

article thumbnail

Digital Transformation Focused on Sustainability

Cloudera

My inspiration for writing this blog was a recent trip to a warehouse and distribution center of a well-known U.S. fast-food enterprise with a reputation for superior quality. During my visit, I had the opportunity to chat with the center’s Manager for Food Safety whose credentials (Ph.D. in Food Science), knowledge, and experience reflect the company’s commitment to product safety and quality.

Food 40
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Train Deep Learning Models on AWS

Zalando Engineering

A real-life example of how to train a Deep Learning model on an AWS Spot Instance using Spotty Spotty is a tool that simplifies training of Deep Learning models on AWS. Why will you ❤️this tool? it makes training on AWS GPU instances as simple as a training on your local computer it automatically manages all necessary AWS resources including AMIs, volumes and snapshots it makes your model trainable on AWS by everyone with a couple of commands it detaches remote processes from SSH sessions it sav

article thumbnail

Open Source: October Review - Hacktoberfest, new releases and more.

Zalando Engineering

Project Highlights Connexion version 2.0 with OpenAPI 3 support is ready, check out what is new in our latest release! Connexion is the Swagger/OpenAPI first framework for Python on top of Flask with automatic endpoint validation & OAuth2 support. With 87 active contributors and more than 1,000 repositories that depend on Connexion worldwide makes this project one of the most successful open source releases of Zalando.

article thumbnail

Connexion 2.0 Release

Zalando Engineering

Today, we released Connexion 2.0 with OpenAPI 3 support. Connexion is a Python framework that automagically handles HTTP requests based on OpenAPI Specification (formerly known as Swagger Spec) of your API described in YAML format. Connexion allows you to write a Swagger specification, then maps the endpoints to your Python functions. Besides routing, Connexion also validates requests and responses automatically based on OpenAPI specifications, handles common authentication schemes, supports API

Python 40
article thumbnail

Dynamic Typing in SQL

Rockset

As Peter Bailis put it in his post , querying unstructured data using SQL is a painful process. Moreover, developers frequently prefer dynamic programming languages, so interacting with the strict type system of SQL is a barrier. We at Rockset have built the first schemaless SQL data platform. In this post and a few others that follow, we'd like to introduce you to our approach.

SQL 40
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Why SQL on Raw Data?

Rockset

Over a decade after the inception of the Hadoop project, the amount of unstructured data available to modern applications continues to increase. Moreover, despite forecasts to the contrary, SQL remains the lingua franca of data processing; today's NoSQL and Big Data infrastructure platform usage often involves some form of SQL-based querying. This longevity is a testament to the community of analysts and data practitioners who are familiar with SQL as well as the mature ecosystem of tools around

article thumbnail

Making smart cities safer with data

Cloudera

By Mark Micallef, Vice President of Asia Pacific and Japan , Cloudera. What comes to your mind when you think of the term “smart city”? For me, it conjures an image of a city where everything is interconnected, enabling it to run efficiently and offer convenient, secure, and personalized services to its residents at the touch of their fingertips. While such a city might sound like a utopian dream, it could potentially turn into a dystopian nightmare if we overlook the risks brought about by the