Top Data Engineering Digest Big Data Skills Big Data Content for 2018

2018

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. This post distills fragments of wisdom accumulated while working at Yahoo, Facebook, Airbnb and Lyft, with the perspective of well over a decade of data warehousing

Data Process

Data Process Data Engineering Data Engineer Process

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Späti

NOVEMBER 28, 2018

These days, everyone talks about open-source. However, this is still not common in the Data Warehouse (DWH) field. Why is this? In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system. I went with Apache Druid for data storage, Apache Superset for querying and Apache Airflow as a task orchestrator.

Data Warehouse

Data Warehouse Data Storage Data Architecture Architecture

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Data Engineering Podcast

APRIL 22, 2018

Summary The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden.

Business Intelligence

Business Intelligence Metadata Management Data Governance

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber Engineering

NOVEMBER 20, 2018

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex … The post Observability at Scale: Building Uber’s Alerting Ecosystem appeared first on Uber Engineering Blog.

Building

Building Architecture Engineering

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Our learnings from adopting GraphQL

Netflix Tech

DECEMBER 10, 2018

A Marketing Tech Campaign by Artem Shtatnov and Ravi Srinivas Ranganathan In an earlier blog post , we provided a high-level overview of some of the applications in the Marketing Technology team that we build to enable scale and intelligence in driving our global advertising, which reaches users on sites like The New York Times, Youtube, and thousands of others.

Coding

Coding Aggregated Data Utilities Architecture

Cloudera + Hortonworks, from the Edge to AI

Cloudera

OCTOBER 3, 2018

We’ve just announced that Cloudera and Hortonworks have agreed to merge to form a single company. I want to explain the thinking behind the deal and the combination. Rob Bearden from Hortonworks has written up a post sharing his thoughts, as well. First, remember the history of Apache Hadoop. Google built an innovative scale-out platform for data storage and analysis in the late 1990s and early 2000s, and published research papers about their work.

Hadoop

Hadoop Cloud Data Storage Machine Learning

Do These Things if you Want to Succeed as an HR Professional

U-Next

JANUARY 9, 2018

Success in today’s businesses has taken several meanings. Apart from just pay hikes and promotions, success has gotten new dimensions that have been of very recent origins. Today, success has become synonymous with happiness at a workplace, challenging tasks, compensatory rewards, incentives, authoritative job profiles, influential role, and more. The current talent pools in organizations have become wiser and more mature than their previous generation counterparts.

Recruitment

Recruitment Technology Management IT

More Trending

Do These Things if you Want to Succeed as an HR Professional

U-Next

JANUARY 9, 2018

Recruitment

Recruitment Technology Management IT

Cloud Nine: All Your Analytics, Wherever You Want Them. Really!

Teradata

DECEMBER 17, 2018

Brian Wood explains how Teradata Vantage in the cloud has your back when it comes to analytic simplicity, control, effectiveness, and results.

Cloud

Cloud IT

Creating Multi-language NLP Pipelines with Apache Spark

Domino Data Lab: Data Engineering

DECEMBER 22, 2018

In this guest post, Holden Karau , Apache Spark Committer , provides insights on how to create multi-language pipelines with Apache Spark and avoid rewriting spaCy into Java. She has already written a complementary blog post on using spaCy to process text data for Domino. Karau is a Developer Advocate at Google as well as a co-author on High Performance Spark and Learning Spark.

Java

Java Coding Process Machine Learning

Live Dashboards on Streaming Data - A Tutorial Using Amazon Kinesis and Rockset

Rockset

DECEMBER 20, 2018

We live in a world where diverse systems—social networks, monitoring, stock exchanges, websites, IoT devices—all continuously generate volumes of data in the form of events, captured in systems like Apache Kafka and Amazon Kinesis. One can perform a wide variety of analyses, like aggregations, filtering, or sampling, on these event streams, either at the record level or over sliding time windows.

AWS

AWS Kafka Data Ingestion Data

One Audio Sequencer to Rule Them All

Pandora Engineering

DECEMBER 5, 2018

Photo credit: Carol Yepes Last month Pandora announced a public podcast beta in conjunction with the Podcast Genome Project. This rollout introduced many exciting features to our current mobile application offerings, including fully integrated and native podcast support. Ironically, one of the most interesting features and perhaps our biggest engineering win with this iteration is something that’s transparent to our end users: the inclusion of a new audio playback sequencer used exclusively for

Media

Media Algorithm Coding Data Science

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Open Source: November Review - Maintainer training, new releases and more

Zalando Engineering

DECEMBER 5, 2018

Project Highlights ExternalDNS version 0.5.9 is ready for testing. This project allows you to control DNS records dynamically via Kubernetes resources in a DNS provider-agnostic way. ExternalDNS also successfully made its way to the Kubernetes Incubator. Check out the list of changes in this new release. Zalando-Incubator welcomed two brand new open source projects 1) Darty - a data dependency manager for data science projects.

PostgreSQL

PostgreSQL Java Machine Learning Deep Learning

OLAP, what’s coming next?

Simon Späti

NOVEMBER 23, 2018

Are you on the lookout for a replacement for the Microsoft Analysis Cubes, are you looking for a big data OLAP system that scales ad libitum, do you want to have your analytics updated even real-time? In this blog, I want to show you possible solutions that are ready for the future and fits into existing data architecture. What is OLAP? OLAP is an acronym for Online Analytical Processing.

Big Data

Big Data Data Architecture Architecture Systems

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Data Engineering Podcast

DECEMBER 31, 2018

Summary As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project.

Lambda Architecture

Lambda Architecture Process Data Process Kafka

Maximizing Process Performance with Maze, Uber’s Funnel Visualization Platform

Uber Engineering

AUGUST 16, 2018

At Uber, we spend a considerable amount of resources making the driver sign-up experience as easy as possible. At Uber’s scale, even a one percent increase in the rate of sign-ups to first trips (the driver conversion rate) carries a … The post Maximizing Process Performance with Maze, Uber’s Funnel Visualization Platform appeared first on Uber Engineering Blog.

Process

Process Engineering Architecture Data

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Netflix OSS and Spring Boot?—?Coming Full Circle

Netflix Tech

DECEMBER 18, 2018

Netflix OSS and Spring Boot?—?Coming Full Circle Taylor Wicksell, Tom Cellucci, Howard Yuan, Asi Bross, Noel Yap, and David Liu In 2007, Netflix started on a long road towards fully operating in the cloud. Much of Netflix’s backend and mid-tier applications are built using Java, and as part of this effort Netflix engineering built several cloud infrastructure libraries and systems?

Java

Java Cloud AWS Government

Bringing AIOps to Machine Learning & Analytics

Cloudera

AUGUST 31, 2018

Two years ago I founded Hyperpilot with the mission to enable autopilot for container infrastructure. We learned a lot about data center automation based on real-time application and diagnostic feedback using applied machine learning. Last month, I joined Cloudera along with former team members Xiaoyun Zhu and Che-Yuan Liang to bring our expertise in intelligent automation to Cloudera’s modern platform for machine learning and analytics.

Machine Learning

Machine Learning Utilities Cloud Architecture

Announcing my session at #SQLBits - Azure Databricks

Advancing Analytics: Data Engineering

DECEMBER 3, 2018

Simon Whiteley and I will be back at #SQLBits 2019 talking about hashtag#DataEngineering and #DataScience in Databricks. We will look at #ApacheSpark #Python #Engineering & #MachineLearning in this full day training day. Register Now Have you looked at Azure DataBricks yet? No! Then you need to. Why you ask, there are many reasons. The number 1, knowing how to use Apache Spark will earn you more money.

Data Science

Data Science Machine Learning Python Data Pipeline

Making slow queries fast with composite indexes in MySQL

nodeSWAT

AUGUST 21, 2018

Making slow queries fast using composite indexes in MySQL This post expects some basic knowledge of SQL. Examples were made using MySQL 5.7.18 and run on my mid 2014 Macbook Pro. Query execution times are based on multiple executions so index caching can kick in. The use-case came from a real application and the solution is used in production. So you have inserted preliminary data to your database and run a simple COUNT(*) query against it with a simple WHERE clause and… the spinner is still run

MySQL

MySQL Datasets Database SQL

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineer

Data Science vs Engineering: Tension Points

Domino Data Lab: Data Engineering

DECEMBER 15, 2018

This blog post provides highlights and a full written transcript from the panel, “ Data Science Versus Engineering: Does It Really Have To Be This Way? ” with Amy Heineike , Paco Nathan , and Pete Warden at Domino HQ. Topics discussed include the current state of collaboration around building and deploying models, tension points that potentially arise, as well as practical advice on how to address these tension points.

Data Science

Data Science Engineering Data Building

Recap of Hadoop News for July 2018

ProjectPro

AUGUST 1, 2018

News on Hadoop - July 2018 Hadoop data governance services surface in wake of GDPR.TechTarget.com, July 2, 2018. GDPR has turned out to be a strong motivator that would bring greater governance to big data. At the recent DataWorks Summit 2018 , though most of the attention was focussed on how Hadoop pioneer Hortonworks is all set to expand its service in the cloud, there was great interest and importance put on managing data privacy as well.

Hadoop

Hadoop Pharmaceutical Healthcare Data Lake

Programming Best Practices For Data Science

Dataquest

JUNE 8, 2018

The data science life cycle is generally comprised of the following components: data retrieval data cleaning data exploration and visualization statistical or predictive modeling While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow. Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script.

Data Science

Data Science Programming Data Data Pipeline

#NoEstimates

Zalando Engineering

OCTOBER 31, 2018

Why I advocate a practice of no estimates as a software engineer Before I get to the topic, I would like to clarify one thing: I don’t want to ban estimations generally from software development, as there are good and solid reasons for it. In a nutshell, business needs to be predictable. I want to show a software developer's view on how to reduce or even get rid of endless estimations meetings with doubtful outcomes.

Software Engineer

Software Engineer Software Engineering Coding Building

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

AI at the Forefront of Digital Transformation Process in 2018

InData Labs

MAY 7, 2018

Digital Transformation Definition Digital transformation has been a big topic for a few years now, and it has many definitions. From a business perspective, digital transformation is about leveraging digital technologies to improve processes, competencies, and business models. It is also about changing the culture of the company because it requires letting go of old.

Process

Process Technology IT Data Science

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Data Engineering Podcast

DECEMBER 23, 2018

Summary Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

PostgreSQL

PostgreSQL Kafka Data Engineering Data Engineer

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

AUGUST 3, 2018

From driver and rider locations and destinations, to restaurant orders and payment transactions, every interaction on Uber’s transportation platform is driven by data. Data powers Uber’s global marketplace, enabling more reliable and seamless user experiences across our products for riders, … The post Databook: Turning Big Data into Knowledge with Metadata at Uber appeared first on Uber Engineering Blog.

Metadata

Metadata Big Data Transportation Data

Netflix Information Security: Preventing Credential Compromise in AWS

Netflix Tech

NOVEMBER 28, 2018

by Will Bengtson Previously we wrote about a method for detecting credential compromise in your AWS environment. The methodology focused on a continuous learning model and first use principle. This solution still is reactive in nature?—?we only detect credential compromise after it has already happened. Even with detection capabilities, there is a risk that exposed credentials can provide access to sensitive data and/or the ability to cause damage in our environment.

AWS

AWS Metadata Amazon Web Services Cloud

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Meet the newest Data Superheros: The Sixth Annual Data Impact Awards Finalists Are…

Cloudera

AUGUST 28, 2018

Drum roll… Starting from well over 100 nominations, we are excited to announce the finalists for this year’s Data Impact Awards ! Each year, nominees have raised the bar, and this year is no exception. The level of impact that organizations have shown and the variety of use cases are inspiring. From AI models that power retail customer decision engines to utility meter analysis that disables underperforming gas turbines, these finalists demonstrate how machine learning and analytics have become

Pharmaceutical

Pharmaceutical Telecommunication Consulting Food

New on Cloud Academy: Machine Learning on Google Cloud and AWS, Big Data Analytics, Terraform, and more

Cloud Academy

MAY 2, 2018

A 2017 IDC White Paper “recommend[s] that organizations that want to get the most out of cloud should train a wide range of stakeholders on cloud fundamentals and provide deep training to key technical teams ” (emphasis ours). Regular readers of the Cloud Academy blog know we’ve been talking about this for a long time. Future-proofing your organization requires technical excellence, collective experience, business context, and shared understanding.

Google Cloud

Google Cloud Machine Learning Big Data AWS

Concurrency, MySQL and Node.js: A journey of discovery

nodeSWAT

FEBRUARY 5, 2018

Our story begins like so many others with a code loving protagonist — someone we all can relate to. His days are largely filled with designing code, writing code and reading about code — keeping clients happy while learning and having fun. This has been going on for years now with both MySQL and Node.js among others and as such our protagonist considers himself quite proficient with both those technologies.

MySQL

MySQL Database Programming Coding

Collaboration Between Data Science and Data Engineering: True or False?

Domino Data Lab: Data Engineering

NOVEMBER 18, 2018

This blog post includes candid insights about addressing tension points that arise when people collaborate on developing and deploying models. Domino’s Head of Content sat down with Don Miner and Marshall Presser to discuss the state of collaboration between data science and data engineering. The blog post provides distilled insights, audio clips, excerpted quotes as well as the full audio and written transcript.

Data Science

Data Science Data Engineering Data Engineer Engineering

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

Data

2018

Functional Data Engineering — a modern paradigm for batch data processing

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Webinars

Trending Sources

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Webinars

Observability at Scale: Building Uber’s Alerting Ecosystem

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Our learnings from adopting GraphQL

Cloudera + Hortonworks, from the Edge to AI

Do These Things if you Want to Succeed as an HR Professional

Sign up to get articles personalized to your interests!

More Trending

Do These Things if you Want to Succeed as an HR Professional

Cloud Nine: All Your Analytics, Wherever You Want Them. Really!

Creating Multi-language NLP Pipelines with Apache Spark

Live Dashboards on Streaming Data - A Tutorial Using Amazon Kinesis and Rockset

One Audio Sequencer to Rule Them All

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Open Source: November Review - Maintainer training, new releases and more

OLAP, what’s coming next?

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Maximizing Process Performance with Maze, Uber’s Funnel Visualization Platform

How to Modernize Manufacturing Without Losing Control

Netflix OSS and Spring Boot?—?Coming Full Circle

Bringing AIOps to Machine Learning & Analytics

Announcing my session at #SQLBits - Azure Databricks

Making slow queries fast with composite indexes in MySQL

The Ultimate Guide to Apache Airflow DAGS

Data Science vs Engineering: Tension Points

Recap of Hadoop News for July 2018

Programming Best Practices For Data Science

#NoEstimates

Optimizing The Modern Developer Experience with Coder

AI at the Forefront of Digital Transformation Process in 2018

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Databook: Turning Big Data into Knowledge with Metadata at Uber

Netflix Information Security: Preventing Credential Compromise in AWS

15 Modern Use Cases for Enterprise Business Intelligence

Meet the newest Data Superheros: The Sixth Annual Data Impact Awards Finalists Are…

New on Cloud Academy: Machine Learning on Google Cloud and AWS, Big Data Analytics, Terraform, and more

Concurrency, MySQL and Node.js: A journey of discovery

Collaboration Between Data Science and Data Engineering: True or False?

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Stay Connected