2017

article thumbnail

The Downfall of the Data Engineer

Maxime Beauchemin

This post follows up on The Rise of the Data Engineer , a recent post that was an attempt at defining data engineering and described how this new role relates to historical and modern roles in the data space. In this post, I want to expose the challenges and risks that cripple data engineers and enumerates the forces that work against this discipline as it goes through its adolescence.

article thumbnail

Evolving Distributed Tracing at Uber Engineering

Uber Engineering

Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber Engineering, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds … The post Evolving Distributed Tracing at Uber Engineering appeared first on Uber Engineering Blog.

article thumbnail

Wallaroo with Sean T. Allen - Episode 12

Data Engineering Podcast

Summary Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project.

Kafka 100
article thumbnail

8 Key Facts You Should know if You are a HR Professional

U-Next

Two of the most common reasons why people think they can be great HR professionals are either they are very organized and systematic or they have good people skills. But these two qualities alone are not enough for anyone to make it big in their career in human resource management. The two attributes can land them jobs but to move up the ladder, they definitely need some qualities that will set them apart from other employees.

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Constant Gardening

Zalando Engineering

How effective management is a continuing story of growth Producers’ Style One of the things I struggled the most with in the past year was identifying the best way to lead my teams. I worked a lot on myself, observed my peers, and tried to learn from my leads, but in the end, I ran into into the well known dilemma: task-focused or people-focused management, which one is best?

article thumbnail

Recap of Hadoop News for November 2017

ProjectPro

News on Hadoop - November 2017 IBM leads BigInsights for Hadoop out behind barn. Shots heard.theRegister.co.uk, November 8, 2017. IBM’s BigInsights for Hadoop sunset on December 6, 2017. IBM will not provide any further new instances for the basic plan of its data analytics platform. The existing instances will continue to be available on the Bluemix console as is from December 7, 2017 to November 7, 2018.

Hadoop 52

More Trending

article thumbnail

Deep Learning in Cloudera

Cloudera

Deep learning is in the news. It’s changing the game. It’s changing your life. It’s changing everything. It will change the world. It’s good to see people excited about technology. But deep learning is a tool that enterprises use to solve practical problems. Nothing more, and nothing less. In this blog, we provide a few examples that show how organizations put deep learning to work.

article thumbnail

Apache Airflow and the Future of Data Engineering: A Q&A

Maxime Beauchemin

With a brief Introduction and Takeaway added by Taylor D. Edmiston Introduction Every once in a while I read a post about the future of tech that resonates with clarity. A few weeks ago it was The Rise of the Data Engineer by Maxime Beauchemin, a data engineer at Airbnb and creator of their data pipeline framework, Apache Airflow. At Astronomer, Apache Airflow is at the very core of our tech stack : our integration workflows are defined by data pipelines built in Apache Airflow as directed acycl

article thumbnail

The Rise of the Data Engineer

Maxime Beauchemin

I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer. I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely. My team was at forefront of this transformation.

article thumbnail

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

Data Engineering Podcast

Summary Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll

Database 100
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

Data Engineering Podcast

Summary To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it.

Kafka 100
article thumbnail

data.world with Bryon Jacob - Episode 9

Data Engineering Podcast

Summary We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.

article thumbnail

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Data Engineering Podcast

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Hadoop 100
article thumbnail

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

Data Engineering Podcast

Summary Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Astronomer with Ry Walker - Episode 6

Data Engineering Podcast

Summary Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.

article thumbnail

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Data Engineering Podcast

Summary Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently

article thumbnail

ScyllaDB with Eyal Gutkind - Episode 4

Data Engineering Podcast

Summary If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch

Database 100
article thumbnail

Defining Data Engineering with Maxime Beauchemin - Episode 3

Data Engineering Podcast

Summary What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more. Transcript provided by CastSource Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Dask with Matthew Rocklin - Episode 2

Data Engineering Podcast

Summary There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the news

Hadoop 100
article thumbnail

Pachyderm with Daniel Whitenack - Episode 1

Data Engineering Podcast

Summary Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!

Data Lake 100
article thumbnail

Introducing The Show

Data Engineering Podcast

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes , or Google Play Music , share it on social media, and tell your friends and co-workers.

Media 100
article thumbnail

Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop

Uber Engineering

With the evolution of storage formats like Apache Parquet and Apache ORC and query engines like Presto and Apache Impala , the Hadoop ecosystem has the potential to become a general-purpose, unified serving layer for workloads that can tolerate latencies … The post Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop appeared first on Uber Engineering Blog.

Hadoop 94
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Re-Architecting Cash and Digital Wallet Payments for India with Uber Engineering

Uber Engineering

Uber is developing a payment platform for India that enables operations teams to more seamlessly collect and distribute cash and digital wallet payments to drivers. In this article, San Francisco-based software engineer Yijun Liu reflects on his experiences working with … The post Re-Architecting Cash and Digital Wallet Payments for India with Uber Engineering appeared first on Uber Engineering Blog.

article thumbnail

The Road to uChat: Building Uber’s Internal Chat Solution

Uber Engineering

Two years ago, Uber’s previous chat application began showing signs that it would not be able to adapt to our growth. There were app crashes, performance hiccups, and outages that crippled our company’s ability to effectively communicate online. With user … The post The Road to uChat: Building Uber’s Internal Chat Solution appeared first on Uber Engineering Blog.

article thumbnail

Engineering Uber Predictions in Real Time with ELK

Uber Engineering

Uber’s services rely on the accuracy of our event prediction a n d f o r e c a s t i n g t o o l s. From estimating rider demand on a given date to predicting … The post Engineering Uber Predictions in Real Time with ELK appeared first on Uber Engineering Blog.

article thumbnail

Introducing AthenaX, Uber Engineering’s Open Source Streaming Analytics Platform

Uber Engineering

Uber facilitates seamless and more enjoyable user experiences by channeling data from a variety of real-time sources. These insights range from in-the-moment traffic conditions that provide guidance on trip routes to the Estimated Time of Delivery (ETD) of an UberEATS … The post Introducing AthenaX, Uber Engineering’s Open Source Streaming Analytics Platform appeared first on Uber Engineering Blog.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Engineering On-Demand Transportation for Business with Uber Central

Uber Engineering

When Uber launched in 2009, our mission was simple: make transportation as reliable as running water everywhere, for everyone. While our mission remains the same today, the number of Uber use cases have grown dramatically, motivating our engineers to think … The post Engineering On-Demand Transportation for Business with Uber Central appeared first on Uber Engineering Blog.

article thumbnail

Spaghetti and Marshmallows at Zalando: An Exercise to Inspire Deep Learning

Zalando Engineering

Some months ago I had the opportunity, with two fellow Zalandos, to organize the “Dortmund 5PM”; a gathering across all Dortmund teams, scheduled once a month on Fridays in our local event space. We want to foster further cross-team collaboration between individuals, making these meetings a memorable experience for all. We opted for running The Marshmallow Challenge ; a funny design exercise that encourages teams to experience simple yet profound lessons in collaboration, innovation, and creativ

article thumbnail

Recap of Hadoop News for June 2017

ProjectPro

News on Hadoop - June 2017 Hadoop Servers Expose Over 5 Petabytes of Data. BleepingComputer.com, June 2, 2017. According to John Matherly, the founder of Shodan, a search engine used for discovering IoT devices found that Hadoop installed improperly configured HDFS based servers exposed over 5 PB of information. He found approximately 4487 HDFS servers available without authentication through public IP addresses that in total exposed 5120 TB of data.The expert said that 47820 MongoDB servers exp

Hadoop 52
article thumbnail

Hadoop Cluster Overview: What it is and how to setup one?

ProjectPro

What is a Hadoop Cluster? In general, a computer cluster is a collection of various computers that work collectively as a single system. “A hadoop cluster is a collection of independent components connected through a dedicated network to work as a single centralized data processing resource. “ “A hadoop cluster can be referred to as a computational computer cluster for storing and analysing big data (structured, semi-structured and unstructured) in a distributed environment.

Hadoop 52
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.