Top Data Engineering Digest Data Governance Technology Content for February, 2018

February, 2018

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Data Engineering Podcast

FEBRUARY 25, 2018

Summary One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions abou

Kafka

Kafka AWS Data Data Engineer

Code Migration in Production: Rewriting the Sharding Layer of Uber’s Schemaless Datastore

Uber Engineering

FEBRUARY 22, 2018

In 2014, Uber Engineering built Schemaless , our fault-tolerant and scalable datastore, to facilitate the rapid growth of our company. For context, we deployed more than 40 Schemaless instances and many thousands of storage nodes in 2016 alone. As our … The post Code Migration in Production: Rewriting the Sharding Layer of Uber’s Schemaless Datastore appeared first on Uber Engineering Blog.

Coding

Coding Engineering Designing Architecture

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Zalando @ FOSDEM

Zalando Engineering

FEBRUARY 21, 2018

Why FOSDEM is not your average conference I could get cheeky with semantics and point out that the “M” in FOSDEM stands for “Meeting”. But I’ll play nice and focus instead on the specifics of the event itself. FOSDEM has been running since 2001. In that time, it has grown to become the open source community event for Europe. Over a two-day event, thousands of attendees descend upon the ULB in Brussels to attend what is, in reality, a collection of conferences.

Software Engineer

Software Engineer Software Engineering Database Engineering

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Concurrency, MySQL and Node.js: A journey of discovery

nodeSWAT

FEBRUARY 5, 2018

Our story begins like so many others with a code loving protagonist — someone we all can relate to. His days are largely filled with designing code, writing code and reading about code — keeping clients happy while learning and having fun. This has been going on for years now with both MySQL and Node.js among others and as such our protagonist considers himself quite proficient with both those technologies.

MySQL

MySQL Database Programming Coding

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Recap of Hadoop News for January 2018

ProjectPro

FEBRUARY 1, 2018

News on Hadoop - Janaury 2018 Apache Hadoop 3.0 goes GA, adds hooks for cloud and GPUs.TechTarget.com, January 3, 2018. The latest update to the 11 year old big data framework Hadoop 3.0 allows cluster pooling on GPU resources , reduces storage requirements, and adds a novel federation scheme that lets YARN resource manager and the job scheduler expand the number of nodes which can run within a Hadoop cluster.

Hadoop

Hadoop Food Healthcare Cloud Computing

Breaking down data silos: when SAP alone is not enough

Cloudera

FEBRUARY 19, 2018

Running a large company is impossible without having an ERP system in place, and SAP business software remains at the forefront in this category. But when companies are looking towards new technologies such as data lakes, machine learning or predictive analytics, SAP alone is just not enough. To keep up with tech trends, businesses have to face the challenges of integrating SAP with non-SAP technologies and embark on a crusade against data silos.

Data Lake

Data Lake Finance Government Hadoop

Data Teams with Will McGinnis - Episode 19

Data Engineering Podcast

FEBRUARY 18, 2018

Summary The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.

Data Science

Data Science Data Engineer Data Engineering Data Pipeline

More Trending

Data Teams with Will McGinnis - Episode 19

Data Engineering Podcast

FEBRUARY 18, 2018

Data Science

Data Science Data Engineer Data Engineering Data Pipeline

Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka

Uber Engineering

FEBRUARY 16, 2018

In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible. Given the scope … The post Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka appeared first on Uber Engineering Blog.

Kafka

Kafka Building Engineering Systems

Data Analysis with Spark

Zalando Engineering

FEBRUARY 28, 2018

Apache’s lightning fast engine for data analysis and machine learning In recent years, there has been a massive shift in the industry towards data-oriented decision making backed by enormously large data sets. This means that we can serve our customers with more relevant, personalized content. We in the Digital Experience team are tasked with analysing Big Data in order to gather insights and support the product team with the decision making process.

Data Analysis

Data Analysis Hadoop SQL Datasets

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Data Engineering Podcast

FEBRUARY 11, 2018

Summary As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers.

PostgreSQL

PostgreSQL NoSQL Google Cloud MongoDB

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

FEBRUARY 3, 2018

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast.

Kafka

Kafka Data Pipeline Data Science Data Engineer

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

FEBRUARY 3, 2018

Kafka

Kafka Data Pipeline Data Science Data Engineer

2017 – Another Award-Winning Year for Cloudera!

Cloudera

FEBRUARY 16, 2018

In many ways, 2017 was a singular year for Cloudera, not least because we staged a successful IPO and joined the ranks of the world’s fastest-growing, publicly traded companies. We deeply appreciate the vote of confidence and trust our customers have placed in us and are proud of the hard work of our 1,600-plus employees. These are some of the year’s highlights.

Manufacturing

Manufacturing Cloud Computing Healthcare Machine Learning

Cloudera on Cloudera: Our Journey to Becoming more Data-driven

Cloudera

FEBRUARY 14, 2018

I’ve spent the last four years here at Cloudera talking with our customers about how to run their businesses better using their data and Cloudera’s products and services. Now I get to put my money where my mouth is – and turn my focus internally on how we at Cloudera can become more data-driven. We aspire to and are on the journey to be the best-run company on data, and to be our own best reference.

Professional Services

Professional Services Finance Data Cloud

Cybersecurity on Call: Nation-State Cyber Operations with Patrick Tucker

Cloudera

FEBRUARY 6, 2018

As cyber attacks continue to increase across the world, it has become more critical for countries to implement cyber operations from a defensive and offensive perspective to protect national secrets and their citizens. An Arizona State University research paper showed just how global this problem is when they discovered that if hackers discussed a zero-day exploit on the dark web in Chinese the likelihood of a hacker exploiting the vulnerability was 9%.

Government

Government Technology IT Management

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Dave Shuman Talks IoT and Big Data on Federal News Radio

Cloudera

FEBRUARY 12, 2018

What exactly can we expect for IoT in 2018, and how can you improve your organization with connected devices? That was the question Dave Shuman set out to answer when he sat down last month with John Gilroy at the Federal News Radio headquarters in Washington, D.C. Federal Tech Talk looks at the world of high technology in the federal government and, as its host, John speaks the language of federal CISOs, CIOs, and CTOs.

Big Data

Big Data Government Manufacturing Data

Innovation in Digital Experience

Zalando Engineering

FEBRUARY 19, 2018

Multi-functional teams make for a greater customer journey When I started in Zalando Tech, I hadn’t worked with a product manager before, and I had probably never seen a UX designer, a UI designer, a researcher or a business developer before either. My world was data science, more specifically, personalization and recommender systems. In this isolated bubble, data scientists often thought we could solve all problems without help, but in the last two years, I came to understand why we need to sto

Designing

Designing Data Science Management Building

Five Minutes from Machine Learning to RESTful API

Zalando Engineering

FEBRUARY 14, 2018

The benefits of Connexion: Zalando’s open source API-First framework In this article, I will show how quick and simple it can be to create a RESTful API for a machine learning model using Zalando’s open source Swagger/OpenAPI First framework called Connexion. Official documentation describes Connexion as the following: “Connexion is a framework on top of Flask that automagically handles HTTP requests based on OpenAPI 2.0 Specification (formerly known as Swagger Spec) of your API described in YAM

Machine Learning

Machine Learning Python Coding IT

Cross-Lingual End-to-End Product Search with Deep Learning

Zalando Engineering

FEBRUARY 7, 2018

How We Built the Next Generation Product Search from Scratch using a Deep Neural Network Product search is one of the key components in an online retail store. A good product search can understand a user’s query in any language, retrieve as many relevant products as possible, and finally present the results as a list in which the preferred products should be at the top, and the less relevant products should be at the bottom.

Deep Learning

Deep Learning Architecture Retail Coding

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Crushing AVRO Small Files with Spark

Zalando Engineering

FEBRUARY 5, 2018

Solving the many small files problem for AVRO The Fashion Content Platform teams in Zalando Dublin handle large amounts of data on a daily basis. To make sense of it all, we utilise Hadoop (EMR) on AWS. Within this post, we discuss a system where a real-time system feeds the data. Due to the variance in data volumes and the period that these systems write to storage, there can be a large number of small files.

Hadoop

Hadoop Amazon Web Services AWS Utilities

February, 2018

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Code Migration in Production: Rewriting the Sharding Layer of Uber’s Schemaless Datastore

Webinars

Trending Sources

Zalando @ FOSDEM

Webinars

Concurrency, MySQL and Node.js: A journey of discovery

A Guide to Debugging Apache Airflow® DAGs

Recap of Hadoop News for January 2018

Breaking down data silos: when SAP alone is not enough

Data Teams with Will McGinnis - Episode 19

Sign up to get articles personalized to your interests!

More Trending

Data Teams with Will McGinnis - Episode 19

Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka

Data Analysis with Spark

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

2017 – Another Award-Winning Year for Cloudera!

Cloudera on Cloudera: Our Journey to Becoming more Data-driven

Cybersecurity on Call: Nation-State Cyber Operations with Patrick Tucker

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Dave Shuman Talks IoT and Big Data on Federal News Radio

Innovation in Digital Experience

Five Minutes from Machine Learning to RESTful API

Cross-Lingual End-to-End Product Search with Deep Learning

How to Modernize Manufacturing Without Losing Control

Crushing AVRO Small Files with Spark

Stay Connected