Sat.Oct 09, 2021 - Fri.Oct 15, 2021

article thumbnail

How to add tests to your data pipelines

Start Data Engineering

Introduction Testing your data pipeline 1. End-to-end system testing 2. Data quality testing 3. Monitoring and alerting 4. Unit and contract testing Conclusion Further reading Introduction Testing data pipelines are different from testing other applications, like a website backend.

article thumbnail

What is new in Cloudera Streaming Analytics 1.5?

Cloudera

At the end of May, we released the second version of Cloudera SQL Stream Builder (SSB) as part of Cloudera Streaming Analytics (CSA). Among other features, the 1.4 version of CSA surfaced the expressivity of Flink SQL in SQL Stream Builder via adding DDL and Catalog support, and it greatly improved the integration with other Cloudera Data Platform components, for example via enabling stream enrichment from Hive and Kudu. .

Java 120
article thumbnail

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity.

Metadata 100
article thumbnail

Introducing Single Message Transforms and New Connector Features on Confluent Cloud

Confluent

Data silos across an organization are common, with valuable business insights waiting to be uncovered. This is why at Confluent we built a portfolio of fully managed connectors to enable […].

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Whats the difference between ETL & ELT?

Start Data Engineering

1. Introduction 2. E-T-L definition 3. Differences between ETL & ELT 4. Conclusion 5. Further reading 1. Introduction If you are a student, analyst, engineer, or anyone working with data pipelines, you would have heard of ETL and ELT architecture. If you have questions like What is the difference between ETL & ELT? Should I use ETL or ELT pattern for my data pipeline?

article thumbnail

CDP Public Cloud Regional Control Plane is Now Available in Australia and Europe

Cloudera

We’re excited to announce the availability of CDP Public Cloud Regional Control Plane in Australia and Europe. This addition will extend CDP Hybrid capabilities to customers in industries with strict data protection requirements by allowing them to govern their data entirely in-region. CDP’s public cloud architecture is designed to ensure that customer data remains within a customer’s environment at all times, helping enable companies to meet their data protection obligations, including any rest

Cloud 97

More Trending

article thumbnail

Apache Kafka and R: Real-Time Prediction and Model (Re)training

Confluent

Machine learning on real-time data is a powerful combination because you gain direct insights into your data, can make powerful decisions, and consequently improve your business processes and outcomes. It […].

Kafka 90
article thumbnail

What are Common Table Expressions(CTEs) and when to use them?

Start Data Engineering

Introduction Setup Common Table Expressions (CTEs) Performance comparison CTE Subquery and derived tables Temp table Trade-offs Tear down Conclusion References Introduction If you are a student, analyst, engineer, or anyone in the data space and are Wondering what CTEs are? Trying to understand CTE performance Then this post is for you. In this post, we go over what CTEs are and compare their performance to the subquery, derived table, and temp table.

article thumbnail

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

As organizations wrangle with the explosive growth in data volume they are presented with today, efficiency and scalability of storage become pivotal to operating a successful data platform for driving business insight and value. Apache Ozone is a distributed, scalable, and high performance object store, available with Cloudera Data Platform Private Cloud.

Cloud 92
article thumbnail

CAMBI, a banding artifact detector

Netflix Tech

by Joel Sole, Mariana Afonso, Lukas Krasula, Zhi Li, and Pulkit Tandon Introducing the banding artifacts detector developed by Netflix aiming at further improving the delivered video quality Banding artifacts can be pretty annoying. But, first of all, you may wonder, what is a banding artifact? Banding artifact? You are at home enjoying a show on your brand-new TV.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Confluent’s Oracle CDC Connector Now Supports Oracle Database 19c

Confluent

Many Oracle Database customers currently still leverage Oracle 12c or 18c in their production environments, with some even using Oracle 11g. Most of these customers have moved to 19c or […].

article thumbnail

6 Key Concepts, to Master Window Functions

Start Data Engineering

Introduction Prerequisites 6 Key Concepts 1. When to Use 2. Partition By 3. Order By 4. Function 5. Lead and Lag 6. Rolling Window Efficiency Considerations Conclusion Further reading References Introduction If work with data, window functions can significantly level up your SQL skills.

SQL 130
article thumbnail

#ClouderaLife Spotlight: Bryan Bottinelli, Commercial Account Executive

Cloudera

As we continue to celebrate Hispanic Heritage Month, we’d like to shine a spotlight on yet another one of Cloudera’s high performing employees who contributes to the culture and community both in and outside of the Cloudera walls. . Meet Bryan Bottinelli, a 2 year Clouderan and first generation American with roots in Colombia and Chile. . As a Commercial Account Manager, he spends his work days growing the adoption of Cloudera Data Platform (CDP) in the Great Lakes region.

article thumbnail

Revisiting BetterTLS: Certificate Path Building

Netflix Tech

By Ian Haken Last year the AddTrust root certificate expired and lots of clients had a bad time. Some Roku devices weren’t working right, Heroku had problems , and some folks couldn’t even curl. In the aftermath Ryan Sleevi wrote a really great blog post not just about the issue of this one certificate’s expiry, but the problem that so many TLS implementations have in general with certificate path building.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

A Day in the Life of a DataOps Engineer

DataKitchen

A DataOps implementation project consists of three steps. First, you must understand the existing challenges of the data team, including the data architecture and end-to-end toolchain. Second, you must establish a definition of “done.” In DataOps, the definition of done includes more than just some working code. It considers whether a component is deployable, monitorable, maintainable, reusable, secure and adds value to the end-user or customer.

article thumbnail

6 Responsibilities of a Data Engineer

Start Data Engineering

Introduction Responsibilities of a data engineer 1. Move data between systems 2. Manage data warehouse 3. Schedule, execute, and monitor data pipelines 4. Serve data to the end-users 5. Data strategy for the company 6. Deploy ML models to production Conclusion Further reading Introduction Data engineering is a relatively new field, and as such, there is a huge variance in the actual job responsibilities across different companies.

article thumbnail

Your Parents Still Don’t Know What a Hashtag Is. Let’s Teach Them the Basics of Machine Learning and Streaming Data

Cloudera

Quite often, the digital natives of the family — you — have to explain to the analog fans of the family what PDFs are, how to use a hashtag, a phone camera, or a remote. Imagine if you had to explain what machine learning is and how to use it. There’s no need to panic. Cloudera produced a series of ebooks — Production Machine Learning For Dummies , Apache NiFi For Dummies , and Apache Flink For Dummies (coming soon) — to help simplify even the most complex tech topics.

article thumbnail

The Data Janitor Letters - September 2021

Pipeline Data Engineering

Data engineering salon. News and interesting reads about the world of data. Cloudflare’s Disruption Ben Thompson, Stratechery S3’s margin is R2’s opportunity. Operations is not Developer IT Mathew Duggan, DevOps Manager, GAN Integrity It's not their fault, they were told this was easy. How Big Tech Runs Tech Projects and the Curious Absence of Scrum Gergely Orosz A survey of how tech projects run across the industry highlights Scrum being absent from Big Tech.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

15 Machine Learning Regression Projects Ideas for Beginners

ProjectPro

Linear and logistic regression models in machine learning mark most beginners’ first steps into the world of machine learning. Whether you want to understand the effect of IQ and education on earnings or analyze how smoking cigarettes and drinking coffee are related to mortality, all you need is to understand the concepts of linear and logistic regression.

article thumbnail

10 Skills to Ace Your Data Engineering Interviews

Start Data Engineering

Introduction Skills 1. SQL 2. Python 3. Leetcode: data structures and algorithms 4. Data modeling 4.1 Data warehousing 4.2 OLTP 5. Data pipelines 6. Distributed system fundamentals 7. Event streaming 8. System design 9. Business questions 10. Cloud computing 11. Probabilistic data structures (optional) Interview prep, the TL;DR version Conclusion Introduction Are you a student, analyst, engineer, or someone preparing for a data engineering interview and overwhelmed by all the tools and concepts?

article thumbnail

What Is a Cloud Database? IaaS, PaaS, SaaS and DBaaS Explained

Rockset

For many organizations, the advantages of a cloud-based database are clear. They offer scalability, security, and availability. There can also be cost savings over custom and on-premises database solutions. However, not all cloud databases are created equal. Terms like IaaS, PaaS and SaaS have traditionally been used to describe various levels of cloud computing, but how do they apply to cloud databases?

article thumbnail

NextJS and Managing Your Data

Grouparoo

Managing data in a front-end framework is an infinitely solved problem. Every framework has its own flavor, architecture, and opinions on how state should flow; NextJS does not. They provide a variety of methods for data fetching but offers no in-built patterns for data management. On one hand, it doesn't seem unreasonable; NextJS simply expands on React.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Tracing SRE’s journey in Zalando - Part III

Zalando Engineering

This is the third and last part of our journey to roll out SRE in Zalando. You’ll find the previous chapters here and here. Thanks for following our story. 2020 - From team to department The road so far: 2016 saw an attempt at the rollout of a Site Reliability Engineering (SRE) organization that did not quite materialize but still left the seed of SRE in the company; in 2018 and 2019 we had a single SRE team working on strategic projects that improved the reliability of Zalando’s platform.

article thumbnail

October 2021 dbt Update: Metrics and Hat Tricks ?

dbt Developer Hub

Hello there, While I have a lot of fun things to share this month, I can't start with anything other than this: Yep, it's official: ? dbt will support metric definitions ? With this feature, you'll be able to centrally define rules for aggregating metrics (think, "active users" or "MRR") in version controlled, tested, documented dbt project code. We still have a ways to go, but in future, you'll be able to explore these metrics in the BI and analytics tools that you know and love.

article thumbnail

Operational data lineage with dbt

Datakin

dbt is an amazing way to transform data within a data warehouse. So amazing, in fact, that it’s easy to end up doing tons and tons of transformations on all kinds of datasets. After a while, it can become an innavigable collection of overlapping tables. That’s a problem when it comes time to troubleshoot. If you use Datakin to observe your dbt models as they run, you can always know exactly where your datasets came from and how they were created.

article thumbnail

Building a Platform for Content Curation

Afterpay Tech

Photo by Nick Fewings on Unsplash By: Tony Tamplin After years of growth and development on evolving products, Afterpay decided it was time to apply the knowledge accumulated and create a consistent and focused direction across all products. One part of that plan involved rebuilding the website from the ground up to provide features and performance that Afterpay’s users deserve.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Announcing O’Reilly’s Data Quality Fundamentals

Monte Carlo

On behalf of the entire company, I’m excited to announce the release of Data Quality Fundamentals: A Practitioner’s Guide to Building More Trustworthy Data Pipelines , published by O’Reilly Media and available for free on the Monte Carlo website. This is the first book published by O’Reilly to educate the market on how best-in-class data teams design and architect technical systems to achieve trustworthy and reliable data at scale.

article thumbnail

Firm Foundations Needed for 5G Exploration

Teradata

Telcos, their customers, & a range of enterprises are entering a period of experimentation with 5G. The opportunities for innovation & growth are immense – but the costs & risks are outsized too.

52
article thumbnail

Tuning Image Classifiers using Human-In-The-Loop

Zalando Engineering

In this blog post we describe an algorithm we developed when building our product image analysis infrastructure, where we use human-in-the-loop to tune the thresholds of our image classifiers. We discuss the algorithm in the following, and present some mathematical details and a simple code example in the appendices. Background When a customer browses for a product on the Zalando website they may use descriptive terms to search for what they want, for example a customer may use a specific term s

article thumbnail

Databricks Execution Plans

Advancing Analytics: Data Engineering

The execution plans in Databricks allows you to understand how code will actually get executed across a cluster and is useful for optimising queries. It translates operations into optimized logical and physical plans and shows what operations are going to be executed and sent to the Spark Executors. Execution Flow. Databricks uses Catalyst optimizer, which automatically discovers the most efficient plan to execute the operations specified.

article thumbnail

The Cloud Development Environment Adoption Report

Cloud Development Environments (CDEs) are changing how software teams work by moving development to the cloud. Our Cloud Development Environment Adoption Report gathers insights from 223 developers and business leaders, uncovering key trends in CDE adoption. With 66% of large organizations already using CDEs, these platforms are quickly becoming essential to modern development practices.