Top Data Engineering Digest Data Workflow Big Data Content for February, 2019

February, 2019

Journey to Event Driven – Part 3: The Affinity Between Events, Streams and Serverless

Confluent

FEBRUARY 27, 2019

With serverless being all the rage, it brings with it a tidal change of innovation. Given that it is at a relatively early stage, developers are still trying to grok the best approach for each cloud vendor and often face the following question: Should I go cloud native with AWS Lambda, GCP functions, etc., or invest in a vendor-agnostic layer like the serverless framework ?

Kafka

Kafka AWS Architecture Cloud

Managing Uber’s Data Workflows at Scale

Uber Engineering

FEBRUARY 28, 2019

At Uber’s scale, thousands of microservices serve millions of rides and deliveries a day, generating more than a hundred petabytes of raw data. Internally, engineering and data teams across the company leverage this data to improve the Uber experience. … The post Managing Uber’s Data Workflows at Scale appeared first on Uber Engineering Blog.

Data Workflow

Data Workflow Management Raw Data Data

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Protecting a Story’s Future with History and Science

Netflix Tech

FEBRUARY 5, 2019

By Kylee Peña, Chris Clark, and Mike Whipple Kylee’s parents after their wedding in 1978. I?—?Kylee?—?have two photos from my parents’ wedding. Just two. This year they celebrated 40 years of marriage, so both photos were shot on film. Both capture a joy and awkwardness that come with young weddings. They’re fresh and full of life, candid captures from another era.

Technology

Technology Electronics Management Systems

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Deep Learning For Data Engineers

Data Engineering Podcast

FEBRUARY 24, 2019

Summary Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be u

Deep Learning

Deep Learning Data Engineering Data Engineer Engineering

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Cash Is Still King – Make Sure Your Business Is Prepared for the Next Recession

Teradata

FEBRUARY 24, 2019

If your organization understands customer profitability in detail, then your organization can easily navigate through a recession.

Introducing Cloudera DataFlow (CDF)

Cloudera

FEBRUARY 4, 2019

Late last year, the news of the merger between Hortonworks and Cloudera shook the industry and gave birth to the new Cloudera – the combined company with a focus on being an Enterprise Data Cloud leader and a product offering that spans from edge to AI. One of the most promising technology areas in this merger that already had a high growth potential and is poised for even more growth is the Data-in-Motion platform called Hortonworks DataFlow (HDF).

Data Ingestion

Data Ingestion Retail Kafka Data Lake

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Confluent

FEBRUARY 6, 2019

Building a scalable, reliable and performant machine learning (ML) infrastructure is not easy. It takes much more effort than just building an analytic model with Python and your favorite machine learning framework. After all, machine learning with Python requires the use of algorithms that allow computer programs to constantly learn, but building that infrastructure is several levels higher in complexity.

Machine Learning

Machine Learning Python Kafka Java

More Trending

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Confluent

FEBRUARY 6, 2019

Machine Learning

Machine Learning Python Kafka Java

How to Run SQL on PDF Files

Rockset

FEBRUARY 21, 2019

PDFs are the de facto standard for distributing and sharing fixed-layout documents today. A quick survey of my laptop folders reveals account statements, receipts, technical papers, book chapters, and presentation slides—all PDFs. Lots of valuable information finds its way into all manner of PDF files. Which is a great reason for Rockset to support SQL queries on PDF files, in our mission to make data more usable to everyone.

SQL

SQL Metadata Structured Data IT

Building a Cross-platform In-app Messaging Orchestration Service

Netflix Tech

FEBRUARY 11, 2019

George Abraham , Devika Chawla , Chris Beaumont , and Daniel Huang. Thoughtful, relevant, and timely messaging is an integral part of a customer’s Netflix experience. The Netflix Messaging Engineering team builds the platform and the messages to communicate with Netflix customers. Messages in the Netflix App In-app messages at Netflix fall broadly into two channels?

Building

Building Designing AWS Architecture

Speed Up Your Analytics With The Alluxio Distributed Storage System

Data Engineering Podcast

FEBRUARY 18, 2019

Summary Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data.

Systems

Systems Java Media Algorithm

The First Mistake of a CDO: Proposing Business Value

Teradata

FEBRUARY 6, 2019

Kevin Lewis explains the role of chief of data officer.

Data

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

How ATB Financial is Utilizing Hybrid Cloud to Reduce the Time to Value for Big Data Analytics by 90 Percent

Cloudera

FEBRUARY 7, 2019

ATB Financial is Alberta’s largest home grown financial institution, and prides itself on its customer obsession, putting the over 750,000 Albertans at the centre of all that they do. As a result, ATB is constantly transforming in order to ensure it can continue to deliver unparalleled value to Albertans. A key pillar in the transformation journey is focused on robust data operations that can help ATB deliver timely, relevant and delightful service.

Big Data

Big Data Utilities Google Cloud Data Analytics

Kafka Connect Deep Dive – JDBC Source Connector

Confluent

FEBRUARY 12, 2019

One of the most common integrations that people want to do with Apache Kafka ® is getting data in from a database. That is because relational databases are a rich source of events. The existing data in a database, and any changes to that data, can be streamed into a Kafka topic. From there these events can be used to drive applications, be streamed to other data stores such as search replicas or caches and streamed to storage for analytics.

Kafka

Kafka MySQL Bytes Java

What Is Readable Code?

Pandora Engineering

FEBRUARY 11, 2019

Code creates interfaces. But code itself is also an interface.

Coding

Coding Algorithm Programming

Extending Vector with eBPF to inspect host and container performance

Netflix Tech

FEBRUARY 20, 2019

by Jason Koch , with Martin Spier , Brendan Gregg , Ed Hunter Improving the tools available to our engineers to help them diagnose, triage, and work through software performance challenges in the cloud is a key goal for the cloud performance engineering team at Netflix. Today we are excited to announce latency heatmaps and improved container support for our on-host monitoring solution?

Data Collection

Data Collection AWS Cloud Engineering

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Machine Learning In The Enterprise

Data Engineering Podcast

FEBRUARY 11, 2019

Summary Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change.

Machine Learning

Machine Learning Deep Learning Software Engineering Software Engineer

What Lessons Can Apollo 13 Teach Us About Analytics?

Teradata

FEBRUARY 13, 2019

Tom Casey explains lessons from the Apollo 13 program and how they can be applied to day to day dealings in the analytics world.

Programming

Cloudera announces support for Azure’s next-generation Data Lake Store

Cloudera

FEBRUARY 14, 2019

Today we are proud to announce our support for ADLS Gen2 as it enters general availability on Microsoft Azure. CDH 6.1 already includes support for MapReduce and Spark jobs, Hive and Impala queries, and Oozie workflows on ADLS Gen2. The Cloudera platform delivers a one-stop shop that allows you to store any kind of data, process and analyze it in many different ways in a single environment, and integrate with the rest of your data infrastructure.

Data Lake

Data Lake Hadoop Cloud Storage Cloud

All About the Kafka Connect Neo4j Sink Plugin

Confluent

FEBRUARY 28, 2019

Only a little more than one month after the first release, we are happy to announce another milestone for our Kafka integration. Today, you can grab the Kafka Connect Neo4j Sink from Confluent Hub. . Neo4j extension – Kafka sink refresher. We’ve been using the work we did for the Kafka sink – Neo4j extension and have made it available via remote connections over our binary bolt protocol.

Kafka

Kafka Java Programming Language Big Data

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

How to Build a Facebook Messenger Chatbot Powered by Fast SQL on CSV

Rockset

FEBRUARY 28, 2019

A chatbot, like any human customer service rep, needs data about your business and products in order to respond to customers with the correct information. What is an efficient way to hook up your data to a chat application without significant data engineering? In this blog, I will demonstrate how you can build a Facebook Messenger chatbot to help users find vacation rentals using CSV data on Airbnb rentals.

SQL

SQL Building Machine Learning Datasets

Engineering to Improve Marketing Effectiveness (Part 3)?—?Scaling Paid Media campaigns

Netflix Tech

FEBRUARY 4, 2019

Engineering to Improve Marketing Effectiveness (Part 3)?—?Scaling Paid Media campaigns This is the third blog of the series on Marketing Technology at Netflix. This blog focuses on the marketing tech systems that are responsible for campaign setup and delivery of our paid media campaigns. The first blog focused on solving for creative development and localization at scale.

Media

Media Engineering Metadata Java

Cleaning And Curating Open Data For Archaeology

Data Engineering Podcast

FEBRUARY 3, 2019

Summary Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

Digital Media

Digital Media Media PostgreSQL Datasets

Is There Such a Thing as Too Much Parallelism?

Teradata

FEBRUARY 11, 2019

In her blog, Carrie Ballinger discusses parallelism and how you can fashion it to specific needs by using the new sparse map capability

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Governing for digital transformation and growth

Cloudera

FEBRUARY 11, 2019

Ask a CIO where their focus lies and ‘digital transformation’ as well as ‘growth’ will come into the conversation quite quickly. The former sees growing investment in data analytics to become data-driven (45% of organizations expect to increase their spending in this area) while the latter is fueled by disruptive technology and the adoption of AI (41% of organizations name it as their game changer).

Government

Government Data Governance Data Science Machine Learning

Sysmon Security Event Processing in Real Time with KSQL and HELK

Confluent

FEBRUARY 21, 2019

During a recent talk titled Hunters ATT&CKing with the Right Data , which I presented with my brother Jose Luis Rodriguez at ATT&CKcon, we talked about the importance of documenting and modeling security event logs before developing any data analytics while preparing for a threat hunting engagement. Defining relationships among Windows security event logs such as Sysmon , for example, helped us to appreciate the extra context that two or more events together can provide for a hunt.

Process

Process Kafka SQL Datasets

How to Make Space for Research & Innovation?

Zalando Engineering

FEBRUARY 27, 2019

Redesigning research and product development so that the explorative nature of data science becomes a driver for innovation Zalando leverages cutting edge machine learning technologies to be Europe’s leading online platform for fashion and lifestyle. In order to develop these products, data scientists and product roles have to work together closely.

Data Science

Data Science Machine Learning Engineering Building

Using Smart Schema to Accelerate Insights from Nested JSON

Rockset

FEBRUARY 21, 2019

Developers often need to work with datasets without a fixed schema, like heavily nested JSON data with several deeply nested arrays and objects, mixed data types, null values, and missing fields. In addition, the shape of the data is prone to change when continuously syncing new data. Understanding the shape of a dataset is crucial to constructing complex queries for building applications or performing data science investigations.

Datasets

Datasets SQL Data Science Building

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Improving Stream Data Quality with Protobuf Schema Validation

Confluent

FEBRUARY 22, 2019

The requirements for fast and reliable data pipelines are growing quickly at Deliveroo as the business continues to grow and innovate. We have delivered an event streaming platform which gives strong guarantees on data quality, using Apache Kafka ® and Protocol Buffers. Just some of the ways in which we make use of data at Deliveroo include computing optimal rider assignments to in-flight orders, making live operational decisions, personalising restaurant recommendations to users, and prioritisi

Kafka

Kafka Programming Language Metadata Data

Kafka Summit 2019: 3 Big Things!

Confluent

FEBRUARY 20, 2019

How many Kafka Summits should there be in a year? Experts disagree. Some say there should be one giant event where everybody gathers at once. Some say there should be one once a month in different regions of the world. Others say you should live every day like it’s Kafka Summit. As you may know, we have adopted a happy medium: three Summits in 2019.

Kafka

Kafka Programming Architecture Technology

Cloudera’s and Hortonworks’ data platform in the cloud named among Leaders in new Forrester Wave

Cloudera

FEBRUARY 13, 2019

When Cloudera was formed about 10 years ago, the founders believed that companies would jump at the chance to store, manage, and analyze their data in the cloud. Thus, they came up with the name Cloudera, which was a play on “era of cloud.” But, much to their surprise, companies weren’t ready for cloud; they were more focused with on-prem. So, Cloudera focused on helping companies with storing, managing, and analyzing data on-prem.

Cloud

Cloud Hadoop AWS Data

How HelloFresh is Disrupting the Grocery Industry Using Deep Customer Insights.

Cloudera

FEBRUARY 28, 2019

We’ve just published our most recent customer success story ! This story gives a look at how HelloFresh is becoming a more data centric organization to better serve its customers. HelloFresh is the leading global provider of fresh ingredients and recipes that help families enjoy wholesome home-cooked meals with no planning or shopping. The company packages over 10 million meals a month for more than one and a half million customers worldwide.

Data Warehouse

Data Warehouse AWS Business Intelligence Accessible

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

February, 2019

Journey to Event Driven – Part 3: The Affinity Between Events, Streams and Serverless

Managing Uber’s Data Workflows at Scale

Webinars

Trending Sources

Protecting a Story’s Future with History and Science

Webinars

Deep Learning For Data Engineers

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Cash Is Still King – Make Sure Your Business Is Prepared for the Next Recession

Introducing Cloudera DataFlow (CDF)

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Sign up to get articles personalized to your interests!

More Trending

Machine Learning with Python, Jupyter, KSQL and TensorFlow

How to Run SQL on PDF Files

Building a Cross-platform In-app Messaging Orchestration Service

Speed Up Your Analytics With The Alluxio Distributed Storage System

The First Mistake of a CDO: Proposing Business Value

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How ATB Financial is Utilizing Hybrid Cloud to Reduce the Time to Value for Big Data Analytics by 90 Percent

Kafka Connect Deep Dive – JDBC Source Connector

What Is Readable Code?

Extending Vector with eBPF to inspect host and container performance

How to Modernize Manufacturing Without Losing Control

Machine Learning In The Enterprise

What Lessons Can Apollo 13 Teach Us About Analytics?

Cloudera announces support for Azure’s next-generation Data Lake Store

All About the Kafka Connect Neo4j Sink Plugin

The Ultimate Guide to Apache Airflow DAGS

How to Build a Facebook Messenger Chatbot Powered by Fast SQL on CSV

Engineering to Improve Marketing Effectiveness (Part 3)?—?Scaling Paid Media campaigns

Cleaning And Curating Open Data For Archaeology

Is There Such a Thing as Too Much Parallelism?

Optimizing The Modern Developer Experience with Coder

Governing for digital transformation and growth

Sysmon Security Event Processing in Real Time with KSQL and HELK

How to Make Space for Research & Innovation?

Using Smart Schema to Accelerate Insights from Nested JSON

15 Modern Use Cases for Enterprise Business Intelligence

Improving Stream Data Quality with Protobuf Schema Validation

Kafka Summit 2019: 3 Big Things!

Cloudera’s and Hortonworks’ data platform in the cloud named among Leaders in new Forrester Wave

How HelloFresh is Disrupting the Grocery Industry Using Deep Customer Insights.

Apache Airflow® Best Practices: DAG Writing

Stay Connected