Top Data Engineering Digest Data Integration Metadata Content for June, 2019

June, 2019

What’s New in Apache Kafka 2.3

Confluent

JUNE 25, 2019

It’s official: Apache Kafka ® 2.3 has been released! Here is a selection of some of the most interesting and important features we added in the new release. Core Kafka. KIP-351 and KIP-427: Improved monitoring for partitions which have lost replicas. In order to keep your data safe, Kafka creates several replicas of it on different brokers. Kafka will not allow writes to proceed unless the partition has a minimum number of in-sync replicas.

Kafka

Kafka Accessible Accessibility IT

Why Hadoop Failed and Where We Go from Here

Teradata

JUNE 6, 2019

Chad Meley delves into the demise of Hadoop distribution vendors and how they got there.

Hadoop

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Predictive CPU isolation of containers at Netflix

Netflix Tech

JUNE 4, 2019

By Benoit Rostykus, Gabriel Hartmann Noisy Neighbors We’ve all had noisy neighbors at one point in our life. Whether it’s at a cafe or through a wall of an apartment, it is always disruptive. The need for good manners in shared spaces turns out to be important not just for people, but for your Docker containers too. When you’re running in the cloud your containers are in a shared space; in particular they share the CPU’s memory hierarchy of the host instance.

Machine Learning

Machine Learning Metadata Systems Data Collection

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

The Workflow Engine For Data Engineers And Data Scientists

Data Engineering Podcast

JUNE 24, 2019

Summary Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages.

Data Engineer

Data Engineer Data Engineering Engineering Data Science

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Cloudera Provides First Look at Cloudera Data Platform, the Industry’s First Enterprise Data Cloud

Cloudera

JUNE 25, 2019

Cloudera Unveils Industry’s First Enterprise Data Cloud in Webinar. How do you take a mission-critical on-premises workload and rapidly burst it to the cloud? Can you instantly auto-scale resources as demand requires and just as easily pause your work so you don’t run up your cloud bill? On June 18th, Cloudera provided an exclusive preview of these capabilities, and more, with the introduction of Cloudera Data Platform (CDP), the industry’s first enterprise data cloud.

Cloud

Cloud Entertainment Machine Learning Government

Should you have an ETL window in your Modern Data Warehouse?

Advancing Analytics: Data Engineering

JUNE 21, 2019

Ah the ETL (Extract-Transform-Load) Window, the schedule by which the Business Intelligence developer sets their clock, the nail-biting nightly period during which the on-call support hopes their phone won’t ring. It’s a cornerstone of the data warehousing approach… and we shouldn’t have one. There, I said it. Hear me out – back in the on-premises days we had data loading processes that connect directly to our source system databases and perform huge data extract queries as the start of one long

Data Warehouse

Data Warehouse Business Intelligence Data Data Validation

Microservices, Apache Kafka, and Domain-Driven Design

Confluent

JUNE 26, 2019

Microservices have a symbiotic relationship with domain-driven design (DDD)—a design approach where the business domain is carefully modeled in software and evolved over time, independently of the plumbing that makes the system work. I see this pattern coming up more and more in the field in conjunction with Apache Kafka ®. In these projects, microservice architectures use Kafka as an event streaming platform.

Kafka

Kafka Designing Architecture Coding

More Trending

Microservices, Apache Kafka, and Domain-Driven Design

Confluent

JUNE 26, 2019

Kafka

Kafka Designing Architecture Coding

What Working “at Scale” Really Means

Teradata

JUNE 25, 2019

Rob Armstrong discusses the challenges of moving from a departmental solution to operational and production systems working at scale, and how Teradata Vantage can solve for them.

Systems

Netflix Studio Hack Day?—?May 2019

Netflix Tech

JUNE 20, 2019

Netflix Studio Hack Day ?—?May 2019 By Tom Richards , Carenina Garcia Motion , and Marlee Tart Hack Days are a big deal at Netflix. They’re a chance to bring together employees from all our different disciplines to explore new ideas and experiment with emerging technologies. For the most recent hack day, we channeled our creative energy towards our studio efforts.

Java

Java AWS Project Technology

Maintaining Your Data Lake At Scale With Spark

Data Engineering Podcast

JUNE 16, 2019

Summary Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless cu

Data Lake

Data Lake Lambda Architecture Data Warehouse Hadoop

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

Noisy Neighbors in Large, Multi-Tenant Clusters. The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. Once configured and secured, the cluster administrator (admin) gives access to a few individuals to onboard their workloads. Over time, workloads start processing more data, tenants start onboarding more workloads, and administrators (admins) start onboarding more tenants.

Metadata

Metadata Data Lake Cloud Big Data

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Building a SQL Development Environment for Messy, Semi-Structured Data

Rockset

JUNE 13, 2019

Why build a new SQL development environment? We love SQL — our mission is to bring fast, real-time queries to messy, semi-structured real-world data and SQL is a core part of our effort. A SQL API allows our product to fit neatly into the stacks of our users without any workflow re-architecting. Our users can easily integrate Rockset with a multitude of existing tools for SQL development (e.g.

SQL

SQL Structured Data Building Raw Data

Designing the.NET API for Apache Kafka

Confluent

JUNE 27, 2019

Confluent’s clients for Apache Kafka ® recently passed a major milestone—the release of version 1.0. This has been a long time in the making. Magnus Edenhill first started developing librdkafka about seven years ago, later joining Confluent in the very early days to help foster the community of Kafka users outside the Java ecosystem. Since then, the clients team has been on a mission to build a set of high-quality librdkafka bindings for different languages (initially Python , Go , and.NET

Kafka

Kafka Designing Java Coding

How Teradata and Oxford Saïd are Modernizing Analytics for Academic Research

Teradata

JUNE 26, 2019

Oxford and Teradata partner to modernize analytics for academic research, shape new bodies of research and find answers to pressing business challenges.

Unlock the Value of Data Faster Through Modern Data Warehousing

Advancing Analytics: Data Engineering

JUNE 10, 2019

Data has value – I think we’ve finally got to the point where most people agree on this. The problem we face is how long it takes to unlock that value, and it’s a frustration that most businesses I speak to are having. Let’s think about why this is. After the horror that was the “data silo” days, with clumps of data living in Access databases, Excel spreadsheets and isolated data stores, we’ve had a pretty good run with the classic Kimball data warehouse.

Data Warehouse

Data Warehouse Data Lake Data Data Validation

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Managing The Machine Learning Lifecycle

Data Engineering Podcast

JUNE 9, 2019

Summary Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring.

Machine Learning

Machine Learning Management Scala Data Science

Streaming Data from the Universe with Apache Kafka

Confluent

JUNE 13, 2019

You might think that data collection in astronomy consists of a lone astronomer pointing a telescope at a single object in a static sky. While that may be true in some cases (I collected the data for my Ph.D. thesis this way), the field of astronomy is rapidly changing into a data-intensive science with real-time needs. Each night, large-scale astronomical telescope surveys detect millions of changing objects in the sky and need to stream results to scientists for time-sensitive, complementary f

Kafka

Kafka Python Bytes Data Pipeline

Evolving An ETL Pipeline For Better Productivity

Data Engineering Podcast

JUNE 3, 2019

Summary Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service.

Media

Media Data Pipeline Machine Learning Data Science

Building a Scalable Search Architecture

Confluent

JUNE 18, 2019

Software projects of all sizes and complexities have a common challenge: building a scalable solution for search. Who has never seen an application use RDBMS SQL statements to run searches? You might be wondering, is this a good solution? As the databases professor at my university used to say, it depends. Using SQL to run your search might be enough for your use case, but as your project requirements grow and more advanced features are needed—for example, enabling synonyms, multilingual search,

Architecture

Architecture Building Kafka Database-centric

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

How to Connect KSQL to Confluent Cloud using Kubernetes with Helm

Confluent

JUNE 12, 2019

Confluent Cloud, a fully managed event cloud-native streaming service that extends the value of Apache Kafka ® , is simple, resilient, secure, and performant, allowing you to focus on what is important—building contextual event-driven applications, not infrastructure. If you are using Confluent Cloud as your managed Apache Kafka cluster, you probably also want to start using other Confluent Platform components like the Confluent Schema Registry, Kafka Connect, KSQL, and Confluent REST Proxy.

Cloud

Cloud Kafka Healthcare Software Engineer

Reliable, Fast Access to On-Chain Data Insights

Confluent

JUNE 7, 2019

At TokenAnalyst , we are building the core infrastructure to integrate, clean, and analyze blockchain data. Data on a blockchain is also known as on-chain data. We offer both historical and low-latency data streams of on-chain data across multiple blockchains. How we use Apache Kafka and the Confluent Platform. Apache Kafka ® is the central data hub of our company.

Accessibility

Accessibility Accessible Kafka Scala

Spring for Apache Kafka Deep Dive – Part 4: Continuous Delivery of Event Streaming Pipelines

Confluent

JUNE 11, 2019

For event streaming application developers, it is important to continuously update the streaming pipeline based on the need for changes in the individual applications in the pipeline. It is also important to understand some of the common streaming topologies that streaming developers use to build an event streaming pipeline. Here in part 4 of the Spring for Apache Kafka Deep Dive blog series, we will cover: Common event streaming topology patterns supported in Spring Cloud Data Flow.

Kafka

Kafka Cloud Java MongoDB

AI for Industrials: Why is it different?

Teradata

JUNE 18, 2019

Cheryl Wiebe examines the challenges of using AI in industrial situations.

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Four Reasons Why Upgrading to Vantage is Worth It

Teradata

JUNE 16, 2019

Running older Teradata analytics software versions may not support the latest innovations of Vantage and could cost you more than upgrading. Learn more.

Why Vantage Is Our Most Popular Release Ever

Teradata

JUNE 30, 2019

Teradata Vantage is busting through analytic silos and raising the bar. Find out what drove these innovations and led to Vantage becoming our most popular release yet.

Swedbank Delivers Superior Customer Experience by Illuminating the Customer Journey

Teradata

JUNE 25, 2019

Find out how Swedbank has partnered with Teradata to illuminate the customer journey, delivering answers to the business and a superior customer experience.

What Tableau Customers Should Expect Post-Salesforce Acquisition

Teradata

JUNE 11, 2019

Chad Meley examines how Salesforce's acquisition of Tableau will impact customer choice and flexibility.

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

New As-a-Service Offers on Vantage Bring Simplicity, Modernization

Teradata

JUNE 9, 2019

Analytics as a service lets you offload IT infrastructure tasks so you can focus on solving your toughest business problems. Learn more about options for Teradata Vantage.

How Moving to the Cloud Helped Craft the Ideal Fan Experience for Ticketmaster

Teradata

JUNE 23, 2019

Learn how moving to the cloud in 10 weeks enabled Ticketmaster to gain greater visibility into their data and respond to business needs quicker.

Cloud

Cloud Data

Modern Data Warehousing with Azure Databricks at the #PASSSummit in Seattle

Advancing Analytics: Data Engineering

JUNE 10, 2019

Hey everyone, Advancing Analytics are heading to Seattle in November for the PASS Summit. We will be delivering a full day training day on Azure Databricks - Practical Azure Databricks: Engineering & Warehousing at Scale. The session will focus on using Azure Databricks for Modern Data Warehousing. Not sure if the day is for you? Well take a look at the video we recorded.

Data Science

Data Science Data Engineering

How We Use RocksDB at Rockset

Rockset

JUNE 27, 2019

In this blog post, I'll describe how we use RocksDB at Rockset and how we tuned it to get the most performance out of it. I assume that the reader is generally familiar with how Log-Structured Merge tree based storage engines like RocksDB work. At Rockset, we want our users to be able to continuously ingest their data into Rockset with sub-second write latency and query it in 10s of milliseconds.

Bytes

Bytes Metadata Cloud Engineering

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

June, 2019

What’s New in Apache Kafka 2.3

Why Hadoop Failed and Where We Go from Here

Webinars

Trending Sources

Predictive CPU isolation of containers at Netflix

Webinars

The Workflow Engine For Data Engineers And Data Scientists

A Guide to Debugging Apache Airflow® DAGs

Cloudera Provides First Look at Cloudera Data Platform, the Industry’s First Enterprise Data Cloud

Should you have an ETL window in your Modern Data Warehouse?

Microservices, Apache Kafka, and Domain-Driven Design

Sign up to get articles personalized to your interests!

More Trending

Microservices, Apache Kafka, and Domain-Driven Design

What Working “at Scale” Really Means

Netflix Studio Hack Day?—?May 2019

Maintaining Your Data Lake At Scale With Spark

Improving Multi-tenancy with Virtual Private Clusters

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Building a SQL Development Environment for Messy, Semi-Structured Data

Designing the.NET API for Apache Kafka

How Teradata and Oxford Saïd are Modernizing Analytics for Academic Research

Unlock the Value of Data Faster Through Modern Data Warehousing

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Managing The Machine Learning Lifecycle

Streaming Data from the Universe with Apache Kafka

Evolving An ETL Pipeline For Better Productivity

Building a Scalable Search Architecture

How to Modernize Manufacturing Without Losing Control

How to Connect KSQL to Confluent Cloud using Kubernetes with Helm

Reliable, Fast Access to On-Chain Data Insights

Spring for Apache Kafka Deep Dive – Part 4: Continuous Delivery of Event Streaming Pipelines

AI for Industrials: Why is it different?

Optimizing The Modern Developer Experience with Coder

Four Reasons Why Upgrading to Vantage is Worth It

Why Vantage Is Our Most Popular Release Ever

Swedbank Delivers Superior Customer Experience by Illuminating the Customer Journey

What Tableau Customers Should Expect Post-Salesforce Acquisition

15 Modern Use Cases for Enterprise Business Intelligence

New As-a-Service Offers on Vantage Bring Simplicity, Modernization

How Moving to the Cloud Helped Craft the Ideal Fan Experience for Ticketmaster

Modern Data Warehousing with Azure Databricks at the #PASSSummit in Seattle

How We Use RocksDB at Rockset

The Ultimate Guide to Apache Airflow DAGS

Stay Connected