Top Data Engineering Digest Structured Data Machine Learning Content for October, 2018

October, 2018

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Uber Engineering

OCTOBER 17, 2018

Uber is committed to delivering safer and more reliable transportation across our global markets. To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks … The post Uber’s Big Data Platform: 100+ Petabytes with Minute Latency appeared first on Uber Engineering Blog.

Big Data

Big Data Transportation Data Engineering

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Data Engineering Podcast

OCTOBER 28, 2018

Summary Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly.

Scala

Scala Python Data Engineering Data Engineer

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Cloudera + Hortonworks, from the Edge to AI

Cloudera

OCTOBER 3, 2018

We’ve just announced that Cloudera and Hortonworks have agreed to merge to form a single company. I want to explain the thinking behind the deal and the combination. Rob Bearden from Hortonworks has written up a post sharing his thoughts, as well. First, remember the history of Apache Hadoop. Google built an innovative scale-out platform for data storage and analysis in the late 1990s and early 2000s, and published research papers about their work.

Hadoop

Hadoop Cloud Data Storage Machine Learning

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Netflix MediaDatabase?—?Media Timeline Data Model

Netflix Tech

OCTOBER 31, 2018

Netflix Media Database?—?the Media Timeline Data Model In the previous post in this series, we described some important Netflix business needs as well as traits of the media data system?—?called “ N etflix M edia D ata B ase” (NMDB) that is used to address them. The curious reader might have noticed that a majority of these characteristics relate to properties of the data managed by NMDB.

Media

Media Metadata Data MongoDB

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

#NoEstimates

Zalando Engineering

OCTOBER 31, 2018

Why I advocate a practice of no estimates as a software engineer Before I get to the topic, I would like to clarify one thing: I don’t want to ban estimations generally from software development, as there are good and solid reasons for it. In a nutshell, business needs to be predictable. I want to show a software developer's view on how to reduce or even get rid of endless estimations meetings with doubtful outcomes.

Software Engineer

Software Engineer Software Engineering Coding Building

Cloud Native: What It Means in the Data World

Rockset

OCTOBER 30, 2018

Prior to Rockset, I spent eight years at Facebook building out their big data infrastructure and online data infrastructure. All the software we wrote was deployed in Facebook's private data centers, so it was not till I started building on the public cloud that I fully appreciated its true potential. Facebook may be the very definition of a web-scale company, but getting hardware still required huge lead times and extensive capacity planning.

Cloud

Cloud IT MongoDB Hadoop

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Uber Engineering

OCTOBER 30, 2018

Cluster management, a common software infrastructure among technology companies, aggregates compute resources from a collection of physical hosts into a shared resource pool, amplifying compute power and allowing for the flexible use of data center hardware. At Uber, cluster management … The post Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads appeared first on Uber Engineering Blog.

Engineering

Engineering Management Technology Hadoop

More Trending

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Uber Engineering

OCTOBER 30, 2018

Engineering

Engineering Management Technology Hadoop

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.init) - Episode 53

Data Engineering Podcast

OCTOBER 21, 2018

Summary As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product.

Data Science

Data Science Software Engineer Software Engineering Python

Doing a 180 on Customer 360 – The Preferred Path to Customer Insights

Cloudera

OCTOBER 30, 2018

451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms ( watch the replay here ). In this blog post, Sheryl outlines how next-gen CIP applications are delivering a better customer experience, and why businesses are relying on CIPs as their preferred path to customer insights.

Unstructured Data

Unstructured Data Data Lake Algorithm Machine Learning

Recap of Hadoop News for September 2018

ProjectPro

OCTOBER 5, 2018

Hadoop-as-a-Service: The Need Of The Hour For Superior Business Solutions.InsideBigData.com, September 7, 2018 Hadoop is the cornerstone of the big data industry, however, the challenges involved in maintaining the hadoop network has led to the development and growth of Hadoop-as-a-Service (HaaS) market.Industry research reveals that the global Hadoop-as-a-Service market is anticipated to reach $16.2 billion by 2020 growing a a compound annual growth rate of 70.8% from 2014 to 2020.With market l

Hadoop

Hadoop BI Big Data MongoDB

Singleton Types

Zalando Engineering

OCTOBER 24, 2018

A Scala 3 Experiment I'll start this post by admitting that I’ve never gone deeply into any kind of Scala coding on the typelevel. It's not what I, as a common application (or microservice) developer, usually need. Having stated that, of course, I might be missing out on a whole world of opportunities for better code without knowing. And because of that, I put some effort into trying to understanding the features of Scala that might sound strange, overly-theoretical, and maybe even useless, at f

Scala

Scala Coding Database IT

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

The Road Ahead: From Open Source to Open Services

Rockset

OCTOBER 19, 2018

I love open-source but open-source software for data infrastructure is on the way out. There, I said it. And you might think I've got a screw loose, given the broad adoption of open source today, but hear me out. Yes, open source is ubiquitous in data management today, but the era of open-source innovation is all but over. In the age of public cloud, there is no longer a reason to build or use open source for data infrastructure, and a new category of software I'm labeling open services will ren

MongoDB

MongoDB Hadoop Kafka Data Warehouse

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Uber Engineering

OCTOBER 30, 2018

Engineering

Engineering Management Technology Hadoop

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Data Engineering Podcast

OCTOBER 14, 2018

Summary With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake.

Data Lake

Data Lake Big Data Cloud Hadoop

And the 2018 EMEA Partner Summit Award Winners are…

Cloudera

OCTOBER 1, 2018

What an evening! Last week Cloudera hosted over 150 attendees at our annual EMEA Partner Summit in Amsterdam with attendees from over 21 countries across the region. Representatives from across the Cloudera ecosystem came together to hear from company executives and EMEA leadership as well as interactive sessions on Machine Learning, AI and Data Analytics, Cloud and Platform as well training and certification opportunities.

Machine Learning

Machine Learning Consulting Cloud Certification

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

Data Engineering Podcast

OCTOBER 9, 2018

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware.

PostgreSQL

PostgreSQL BI Machine Learning Data Warehouse

Federated Learning: Machine Learning with Privacy on the Edge

Cloudera

OCTOBER 30, 2018

Federated Learning is a technology that allows you to build machine learning systems when your datacenter can’t get direct access to model training data. The data remains in its original location, which helps to ensure privacy and reduces communication costs. Privacy and reduced communication make federated learning a great fit for smartphones and edge hardware, healthcare and other privacy-sensitive use cases, and industrial applications such as predictive maintenance.

Machine Learning

Machine Learning Healthcare Manufacturing Accessible

Growing a Product Area at Zalando

Zalando Engineering

OCTOBER 17, 2018

The six month journey of the customer inbox multi-disciplinary team The customer inbox multi-disciplinary area operates in the Fashion Store pillar of the Zalando platform organization. The purpose of the Customer Inbox Unit is to serve customers personal and practical fashion messages, through multiple channels, i.e. “Target the customers at the right time, at the right place.

Consulting

Consulting Management Engineering IT

A Team for Teams

Zalando Engineering

OCTOBER 9, 2018

How we revolutionized the way we worked agile One and a half years ago we started something new at Zalando. We asked all producers of our department to join one team with the purpose of helping us create great teams to get things done in the best way possible. Where did we start from? The producer role had been introduced at Zalando to provide a team with whatever it lacked at a certain moment in time, be it a roadmap, team building, process improvement, documentation or even testing.

Generalist

Generalist Retail Recruitment BI

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Four Pillars Of Leading People

Zalando Engineering

OCTOBER 3, 2018

Essential building blocks for strong leadership that enables people to grow and achieve results The story of how I ended up working for Zalando in Berlin starts with a LinkedIn message from Joseph Wilkinson, one of our tech recruiters. In tech, we get a lot of messages on LinkedIn, but this one was different and made me very interested to know more about Zalando.

Recruitment

Recruitment Building Engineering Designing

October, 2018

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Webinars

Trending Sources

Cloudera + Hortonworks, from the Edge to AI

Webinars

Netflix MediaDatabase?—?Media Timeline Data Model

A Guide to Debugging Apache Airflow® DAGs

#NoEstimates

Cloud Native: What It Means in the Data World

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Sign up to get articles personalized to your interests!

More Trending

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

Doing a 180 on Customer 360 – The Preferred Path to Customer Insights

Recap of Hadoop News for September 2018

Singleton Types

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

The Road Ahead: From Open Source to Open Services

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

And the 2018 EMEA Partner Summit Award Winners are…

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

Federated Learning: Machine Learning with Privacy on the Edge

Growing a Product Area at Zalando

A Team for Teams

How to Modernize Manufacturing Without Losing Control

Four Pillars Of Leading People

Stay Connected

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.init) - Episode 53