Top Data Engineering Digest Scala Data Warehouse Content for 2017

2017

The Downfall of the Data Engineer

Maxime Beauchemin

AUGUST 28, 2017

This post follows up on The Rise of the Data Engineer , a recent post that was an attempt at defining data engineering and described how this new role relates to historical and modern roles in the data space. In this post, I want to expose the challenges and risks that cripple data engineers and enumerates the forces that work against this discipline as it goes through its adolescence.

Data Engineering

Data Engineering Data Engineer Engineering Software Engineer

Evolving Distributed Tracing at Uber Engineering

Uber Engineering

FEBRUARY 2, 2017

Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber Engineering, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds … The post Evolving Distributed Tracing at Uber Engineering appeared first on Uber Engineering Blog.

Engineering

Engineering Architecture Systems Scala

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Wallaroo with Sean T. Allen - Episode 12

Data Engineering Podcast

DECEMBER 24, 2017

Summary Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project.

Kafka

Kafka Python Data Engineering Data Engineer

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

8 Key Facts You Should know if You are a HR Professional

U-Next

DECEMBER 22, 2017

Two of the most common reasons why people think they can be great HR professionals are either they are very organized and systematic or they have good people skills. But these two qualities alone are not enough for anyone to make it big in their career in human resource management. The two attributes can land them jobs but to move up the ladder, they definitely need some qualities that will set them apart from other employees.

Technology

Technology Process IT Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Constant Gardening

Zalando Engineering

DECEMBER 13, 2017

How effective management is a continuing story of growth Producers’ Style One of the things I struggled the most with in the past year was identifying the best way to lead my teams. I worked a lot on myself, observed my peers, and tried to learn from my leads, but in the end, I ran into into the well known dilemma: task-focused or people-focused management, which one is best?

Management

Management Building Process Project

Recap of Hadoop News for November 2017

ProjectPro

DECEMBER 1, 2017

News on Hadoop - November 2017 IBM leads BigInsights for Hadoop out behind barn. Shots heard.theRegister.co.uk, November 8, 2017. IBM’s BigInsights for Hadoop sunset on December 6, 2017. IBM will not provide any further new instances for the basic plan of its data analytics platform. The existing instances will continue to be available on the Bluemix console as is from December 7, 2017 to November 7, 2018.

Hadoop

Hadoop Medical Unstructured Data Big Data

What is a Data Engineer?

Dataquest

JANUARY 25, 2017

From helping cars drive themselves to helping Facebook tag you in photos , data science has attracted a lot of buzz recently. Data scientists have become extremely sought after , and for good reason — a skilled data scientist can add incredible value to a business. But what about data engineers? Who are they, and what do they do? A data scientist is only as good as the data they have access to.

Data Engineering

Data Engineering Data Engineer Pipeline-centric Database-centric

More Trending

What is a Data Engineer?

Dataquest

JANUARY 25, 2017

Data Engineering

Data Engineering Data Engineer Pipeline-centric Database-centric

Deep Learning in Cloudera

Cloudera

OCTOBER 17, 2017

Deep learning is in the news. It’s changing the game. It’s changing your life. It’s changing everything. It will change the world. It’s good to see people excited about technology. But deep learning is a tool that enterprises use to solve practical problems. Nothing more, and nothing less. In this blog, we provide a few examples that show how organizations put deep learning to work.

Deep Learning

Deep Learning Scala Medical Data Science

Apache Airflow and the Future of Data Engineering: A Q&A

Maxime Beauchemin

FEBRUARY 28, 2017

With a brief Introduction and Takeaway added by Taylor D. Edmiston Introduction Every once in a while I read a post about the future of tech that resonates with clarity. A few weeks ago it was The Rise of the Data Engineer by Maxime Beauchemin, a data engineer at Airbnb and creator of their data pipeline framework, Apache Airflow. At Astronomer, Apache Airflow is at the very core of our tech stack : our integration workflows are defined by data pipelines built in Apache Airflow as directed acycl

Data Engineering

Data Engineering Data Engineer Engineering Data Pipeline

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer. I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely. My team was at forefront of this transformation.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop

Uber Engineering

MARCH 12, 2017

With the evolution of storage formats like Apache Parquet and Apache ORC and query engines like Presto and Apache Impala , the Hadoop ecosystem has the potential to become a general-purpose, unified serving layer for workloads that can tolerate latencies … The post Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop appeared first on Uber Engineering Blog.

Hadoop

Hadoop Process Engineering Data Architecture

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

Data Engineering Podcast

DECEMBER 17, 2017

Summary Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll

Database

Database Data Pipeline Data Engineering Data Engineer

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

Data Engineering Podcast

DECEMBER 10, 2017

Summary To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it.

Kafka

Kafka Data Pipeline Data Engineering Data Engineer

data.world with Bryon Jacob - Episode 9

Data Engineering Podcast

DECEMBER 2, 2017

Summary We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.

Data Pipeline

Data Pipeline Data Engineer Data Engineering Architecture

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Data Engineering Podcast

NOVEMBER 22, 2017

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Hadoop

Hadoop Data Storage Data Pipeline Data Engineering

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Astronomer with Ry Walker - Episode 6

Data Engineering Podcast

AUGUST 6, 2017

Summary Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.

PostgreSQL

PostgreSQL MongoDB Data Pipeline Kafka

ScyllaDB with Eyal Gutkind - Episode 4

Data Engineering Podcast

MARCH 18, 2017

Summary If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch

Database

Database Data Engineering Data Engineer Architecture

Defining Data Engineering with Maxime Beauchemin - Episode 3

Data Engineering Podcast

MARCH 4, 2017

Summary What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more. Transcript provided by CastSource Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.

Data Engineering

Data Engineering Data Engineer Engineering Software Engineer

Dask with Matthew Rocklin - Episode 2

Data Engineering Podcast

JANUARY 22, 2017

Summary There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the news

Hadoop

Hadoop Python Data Analytics Data Engineering

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineer

Pachyderm with Daniel Whitenack - Episode 1

Data Engineering Podcast

JANUARY 14, 2017

Summary Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!

Data Lake

Data Lake Raw Data Kafka Data Engineering

Introducing The Show

Data Engineering Podcast

JANUARY 7, 2017

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes , or Google Play Music , share it on social media, and tell your friends and co-workers.

Media

Media Software Engineer Software Engineering Data Engineering

Re-Architecting Cash and Digital Wallet Payments for India with Uber Engineering

Uber Engineering

JUNE 19, 2017

Uber is developing a payment platform for India that enables operations teams to more seamlessly collect and distribute cash and digital wallet payments to drivers. In this article, San Francisco-based software engineer Yijun Liu reflects on his experiences working with … The post Re-Architecting Cash and Digital Wallet Payments for India with Uber Engineering appeared first on Uber Engineering Blog.

Engineering

Engineering Software Engineer Software Engineering Business Intelligence

The Road to uChat: Building Uber’s Internal Chat Solution

Uber Engineering

JULY 25, 2017

Two years ago, Uber’s previous chat application began showing signs that it would not be able to adapt to our growth. There were app crashes, performance hiccups, and outages that crippled our company’s ability to effectively communicate online. With user … The post The Road to uChat: Building Uber’s Internal Chat Solution appeared first on Uber Engineering Blog.

Building

Building Engineering Coding IT

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Engineering Uber Predictions in Real Time with ELK

Uber Engineering

JULY 24, 2017

Uber’s services rely on the accuracy of our event prediction a n d f o r e c a s t i n g t o o l s. From estimating rider demand on a given date to predicting … The post Engineering Uber Predictions in Real Time with ELK appeared first on Uber Engineering Blog.

Engineering

Engineering Machine Learning Architecture Kafka

Introducing AthenaX, Uber Engineering’s Open Source Streaming Analytics Platform

Uber Engineering

OCTOBER 9, 2017

Uber facilitates seamless and more enjoyable user experiences by channeling data from a variety of real-time sources. These insights range from in-the-moment traffic conditions that provide guidance on trip routes to the Estimated Time of Delivery (ETD) of an UberEATS … The post Introducing AthenaX, Uber Engineering’s Open Source Streaming Analytics Platform appeared first on Uber Engineering Blog.

Engineering

Engineering Data Architecture SQL

Engineering On-Demand Transportation for Business with Uber Central

Uber Engineering

JUNE 29, 2017

When Uber launched in 2009, our mission was simple: make transportation as reliable as running water everywhere, for everyone. While our mission remains the same today, the number of Uber use cases have grown dramatically, motivating our engineers to think … The post Engineering On-Demand Transportation for Business with Uber Central appeared first on Uber Engineering Blog.

Transportation

Transportation Engineering Architecture

Spaghetti and Marshmallows at Zalando: An Exercise to Inspire Deep Learning

Zalando Engineering

AUGUST 23, 2017

Some months ago I had the opportunity, with two fellow Zalandos, to organize the “Dortmund 5PM”; a gathering across all Dortmund teams, scheduled once a month on Fridays in our local event space. We want to foster further cross-team collaboration between individuals, making these meetings a memorable experience for all. We opted for running The Marshmallow Challenge ; a funny design exercise that encourages teams to experience simple yet profound lessons in collaboration, innovation, and creativ

Deep Learning

Deep Learning Education Designing Project

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Recap of Hadoop News for June 2017

ProjectPro

JULY 3, 2017

News on Hadoop - June 2017 Hadoop Servers Expose Over 5 Petabytes of Data. BleepingComputer.com, June 2, 2017. According to John Matherly, the founder of Shodan, a search engine used for discovering IoT devices found that Hadoop installed improperly configured HDFS based servers exposed over 5 PB of information. He found approximately 4487 HDFS servers available without authentication through public IP addresses that in total exposed 5120 TB of data.The expert said that 47820 MongoDB servers exp

Hadoop

Hadoop Food MongoDB Retail

Hadoop Cluster Overview: What it is and how to setup one?

ProjectPro

JUNE 22, 2017

What is a Hadoop Cluster? In general, a computer cluster is a collection of various computers that work collectively as a single system. “A hadoop cluster is a collection of independent components connected through a dedicated network to work as a single centralized data processing resource. “ “A hadoop cluster can be referred to as a computational computer cluster for storing and analysing big data (structured, semi-structured and unstructured) in a distributed environment.

Hadoop

Hadoop IT Data Analysis Big Data

Getting to Know Hadoop 3.0 -Features and Enhancements

ProjectPro

JUNE 14, 2017

Hadoop was first made publicly available as an open source in 2011, since then it has undergone major changes in three different versions. Apache Hadoop 3 is round the corner with members of the Hadoop community at Apache Software Foundation still testing it. The major release of Hadoop 3.x is anticipated to be rolled out sometime mid of 2017. What else can be more exciting for the big data community than waiting for the release of a major new version of the tiny toy elephant?

Hadoop

Hadoop Java Big Data Coding

Signalling Your Jenkins Build Status with a Mini USB Traffic Light

Zalando Engineering

JUNE 7, 2017

As part of an effort to increase developer awareness of quality, we wanted to draw attention the fact that you should have healthy CI builds. The normal procedure revolved around emails sent to the individuals who broke the build with their last commit. With almost all of us used to receiving a lot of email-noise throughout the day, this is not a channel where you can expect an immediate reaction.

Building

Building Project Systems IT

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

Data

2017

The Downfall of the Data Engineer

Evolving Distributed Tracing at Uber Engineering

Webinars

Trending Sources

Wallaroo with Sean T. Allen - Episode 12

Webinars

8 Key Facts You Should know if You are a HR Professional

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Constant Gardening

Recap of Hadoop News for November 2017

What is a Data Engineer?

Sign up to get articles personalized to your interests!

More Trending

What is a Data Engineer?

Deep Learning in Cloudera

Apache Airflow and the Future of Data Engineering: A Q&A

The Rise of the Data Engineer

Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop

Agent Tooling: Connecting AI to Your Tools, Systems & Data

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

data.world with Bryon Jacob - Episode 9

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

How to Modernize Manufacturing Without Losing Control

Astronomer with Ry Walker - Episode 6

ScyllaDB with Eyal Gutkind - Episode 4

Defining Data Engineering with Maxime Beauchemin - Episode 3

Dask with Matthew Rocklin - Episode 2

The Ultimate Guide to Apache Airflow DAGS

Pachyderm with Daniel Whitenack - Episode 1

Introducing The Show

Re-Architecting Cash and Digital Wallet Payments for India with Uber Engineering

The Road to uChat: Building Uber’s Internal Chat Solution

Optimizing The Modern Developer Experience with Coder

Engineering Uber Predictions in Real Time with ELK

Introducing AthenaX, Uber Engineering’s Open Source Streaming Analytics Platform

Engineering On-Demand Transportation for Business with Uber Central

Spaghetti and Marshmallows at Zalando: An Exercise to Inspire Deep Learning

15 Modern Use Cases for Enterprise Business Intelligence

Recap of Hadoop News for June 2017

Hadoop Cluster Overview: What it is and how to setup one?

Getting to Know Hadoop 3.0 -Features and Enhancements

Signalling Your Jenkins Build Status with a Mini USB Traffic Light

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Stay Connected