Sat.Oct 09, 2021 - Fri.Oct 15, 2021

article thumbnail

How to add tests to your data pipelines

Start Data Engineering

Introduction Testing your data pipeline 1. End-to-end system testing 2. Data quality testing 3. Monitoring and alerting 4. Unit and contract testing Conclusion Further reading Introduction Testing data pipelines are different from testing other applications, like a website backend.

article thumbnail

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity.

Metadata 100
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

What is new in Cloudera Streaming Analytics 1.5?

Cloudera

At the end of May, we released the second version of Cloudera SQL Stream Builder (SSB) as part of Cloudera Streaming Analytics (CSA). Among other features, the 1.4 version of CSA surfaced the expressivity of Flink SQL in SQL Stream Builder via adding DDL and Catalog support, and it greatly improved the integration with other Cloudera Data Platform components, for example via enabling stream enrichment from Hive and Kudu. .

Java 114
article thumbnail

Introducing Single Message Transforms and New Connector Features on Confluent Cloud

Confluent

Data silos across an organization are common, with valuable business insights waiting to be uncovered. This is why at Confluent we built a portfolio of fully managed connectors to enable […].

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

10 Skills to Ace Your Data Engineering Interviews

Start Data Engineering

Introduction Skills 1. SQL 2. Python 3. Leetcode: data structures and algorithms 4. Data modeling 4.1 Data warehousing 4.2 OLTP 5. Data pipelines 6. Distributed system fundamentals 7. Event streaming 8. System design 9. Business questions 10. Cloud computing 11. Probabilistic data structures (optional) Interview prep, the TL;DR version Conclusion Introduction Are you a student, analyst, engineer, or someone preparing for a data engineering interview and overwhelmed by all the tools and concepts?

article thumbnail

How And Why To Become Data Driven As A Business

Data Engineering Podcast

Summary Organizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data.

More Trending

article thumbnail

Apache Kafka and R: Real-Time Prediction and Model (Re)training

Confluent

Machine learning on real-time data is a powerful combination because you gain direct insights into your data, can make powerful decisions, and consequently improve your business processes and outcomes. It […].

Kafka 90
article thumbnail

Whats the difference between ETL & ELT?

Start Data Engineering

1. Introduction 2. E-T-L definition 3. Differences between ETL & ELT 4. Conclusion 5. Further reading 1. Introduction If you are a student, analyst, engineer, or anyone working with data pipelines, you would have heard of ETL and ELT architecture. If you have questions like What is the difference between ETL & ELT? Should I use ETL or ELT pattern for my data pipeline?

article thumbnail

CAMBI, a banding artifact detector

Netflix Tech

by Joel Sole, Mariana Afonso, Lukas Krasula, Zhi Li, and Pulkit Tandon Introducing the banding artifacts detector developed by Netflix aiming at further improving the delivered video quality Banding artifacts can be pretty annoying. But, first of all, you may wonder, what is a banding artifact? Banding artifact? You are at home enjoying a show on your brand-new TV.

article thumbnail

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

As organizations wrangle with the explosive growth in data volume they are presented with today, efficiency and scalability of storage become pivotal to operating a successful data platform for driving business insight and value. Apache Ozone is a distributed, scalable, and high performance object store, available with Cloudera Data Platform Private Cloud.

Cloud 87
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Confluent’s Oracle CDC Connector Now Supports Oracle Database 19c

Confluent

Many Oracle Database customers currently still leverage Oracle 12c or 18c in their production environments, with some even using Oracle 11g. Most of these customers have moved to 19c or […].

article thumbnail

What are Common Table Expressions(CTEs) and when to use them?

Start Data Engineering

Introduction Setup Common Table Expressions (CTEs) Performance comparison CTE Subquery and derived tables Temp table Trade-offs Tear down Conclusion References Introduction If you are a student, analyst, engineer, or anyone in the data space and are Wondering what CTEs are? Trying to understand CTE performance Then this post is for you. In this post, we go over what CTEs are and compare their performance to the subquery, derived table, and temp table.

article thumbnail

Revisiting BetterTLS: Certificate Path Building

Netflix Tech

By Ian Haken Last year the AddTrust root certificate expired and lots of clients had a bad time. Some Roku devices weren’t working right, Heroku had problems , and some folks couldn’t even curl. In the aftermath Ryan Sleevi wrote a really great blog post not just about the issue of this one certificate’s expiry, but the problem that so many TLS implementations have in general with certificate path building.

article thumbnail

#ClouderaLife Spotlight: Bryan Bottinelli, Commercial Account Executive

Cloudera

As we continue to celebrate Hispanic Heritage Month, we’d like to shine a spotlight on yet another one of Cloudera’s high performing employees who contributes to the culture and community both in and outside of the Cloudera walls. . Meet Bryan Bottinelli, a 2 year Clouderan and first generation American with roots in Colombia and Chile. . As a Commercial Account Manager, he spends his work days growing the adoption of Cloudera Data Platform (CDP) in the Great Lakes region.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

A Day in the Life of a DataOps Engineer

DataKitchen

A DataOps implementation project consists of three steps. First, you must understand the existing challenges of the data team, including the data architecture and end-to-end toolchain. Second, you must establish a definition of “done.” In DataOps, the definition of done includes more than just some working code. It considers whether a component is deployable, monitorable, maintainable, reusable, secure and adds value to the end-user or customer.

article thumbnail

6 Key Concepts, to Master Window Functions

Start Data Engineering

Introduction Prerequisites 6 Key Concepts 1. When to Use 2. Partition By 3. Order By 4. Function 5. Lead and Lag 6. Rolling Window Efficiency Considerations Conclusion Further reading References Introduction If work with data, window functions can significantly level up your SQL skills.

SQL 130
article thumbnail

The Data Janitor Letters - September 2021

Pipeline Data Engineering

Data engineering salon. News and interesting reads about the world of data. Cloudflare’s Disruption Ben Thompson, Stratechery S3’s margin is R2’s opportunity. Operations is not Developer IT Mathew Duggan, DevOps Manager, GAN Integrity It's not their fault, they were told this was easy. How Big Tech Runs Tech Projects and the Curious Absence of Scrum Gergely Orosz A survey of how tech projects run across the industry highlights Scrum being absent from Big Tech.

article thumbnail

Your Parents Still Don’t Know What a Hashtag Is. Let’s Teach Them the Basics of Machine Learning and Streaming Data

Cloudera

Quite often, the digital natives of the family — you — have to explain to the analog fans of the family what PDFs are, how to use a hashtag, a phone camera, or a remote. Imagine if you had to explain what machine learning is and how to use it. There’s no need to panic. Cloudera produced a series of ebooks — Production Machine Learning For Dummies , Apache NiFi For Dummies , and Apache Flink For Dummies (coming soon) — to help simplify even the most complex tech topics.

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

article thumbnail

What Is a Cloud Database? IaaS, PaaS, SaaS and DBaaS Explained

Rockset

For many organizations, the advantages of a cloud-based database are clear. They offer scalability, security, and availability. There can also be cost savings over custom and on-premises database solutions. However, not all cloud databases are created equal. Terms like IaaS, PaaS and SaaS have traditionally been used to describe various levels of cloud computing, but how do they apply to cloud databases?

article thumbnail

6 Responsibilities of a Data Engineer

Start Data Engineering

Introduction Responsibilities of a data engineer 1. Move data between systems 2. Manage data warehouse 3. Schedule, execute, and monitor data pipelines 4. Serve data to the end-users 5. Data strategy for the company 6. Deploy ML models to production Conclusion Further reading Introduction Data engineering is a relatively new field, and as such, there is a huge variance in the actual job responsibilities across different companies.

article thumbnail

NextJS and Managing Your Data

Grouparoo

Managing data in a front-end framework is an infinitely solved problem. Every framework has its own flavor, architecture, and opinions on how state should flow; NextJS does not. They provide a variety of methods for data fetching but offers no in-built patterns for data management. On one hand, it doesn't seem unreasonable; NextJS simply expands on React.

article thumbnail

October 2021 dbt Update: Metrics and Hat Tricks ?

dbt Developer Hub

Hello there, While I have a lot of fun things to share this month, I can't start with anything other than this: Yep, it's official: ? dbt will support metric definitions ? With this feature, you'll be able to centrally define rules for aggregating metrics (think, "active users" or "MRR") in version controlled, tested, documented dbt project code. We still have a ways to go, but in future, you'll be able to explore these metrics in the BI and analytics tools that you know and love.

article thumbnail

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

article thumbnail

15 Machine Learning Regression Projects Ideas for Beginners

ProjectPro

Linear and logistic regression models in machine learning mark most beginners’ first steps into the world of machine learning. Whether you want to understand the effect of IQ and education on earnings or analyze how smoking cigarettes and drinking coffee are related to mortality, all you need is to understand the concepts of linear and logistic regression.

article thumbnail

Tracing SRE’s journey in Zalando - Part III

Zalando Engineering

This is the third and last part of our journey to roll out SRE in Zalando. You’ll find the previous chapters here and here. Thanks for following our story. 2020 - From team to department The road so far: 2016 saw an attempt at the rollout of a Site Reliability Engineering (SRE) organization that did not quite materialize but still left the seed of SRE in the company; in 2018 and 2019 we had a single SRE team working on strategic projects that improved the reliability of Zalando’s platform.

article thumbnail

Operational data lineage with dbt

Datakin

dbt is an amazing way to transform data within a data warehouse. So amazing, in fact, that it’s easy to end up doing tons and tons of transformations on all kinds of datasets. After a while, it can become an innavigable collection of overlapping tables. That’s a problem when it comes time to troubleshoot. If you use Datakin to observe your dbt models as they run, you can always know exactly where your datasets came from and how they were created.

article thumbnail

Firm Foundations Needed for 5G Exploration

Teradata

Telcos, their customers, & a range of enterprises are entering a period of experimentation with 5G. The opportunities for innovation & growth are immense – but the costs & risks are outsized too.

52
article thumbnail

Business Intelligence 101: How To Make The Best Solution Decision For Your Organization

Speaker: Evelyn Chou

Choosing the right business intelligence (BI) platform can feel like navigating a maze of features, promises, and technical jargon. With so many options available, how can you ensure you’re making the right decision for your organization’s unique needs? 🤔 This webinar brings together expert insights to break down the complexities of BI solution vetting.

article thumbnail

Building a Platform for Content Curation

Afterpay Tech

Photo by Nick Fewings on Unsplash By: Tony Tamplin After years of growth and development on evolving products, Afterpay decided it was time to apply the knowledge accumulated and create a consistent and focused direction across all products. One part of that plan involved rebuilding the website from the ground up to provide features and performance that Afterpay’s users deserve.

article thumbnail

Announcing O’Reilly’s Data Quality Fundamentals

Monte Carlo

On behalf of the entire company, I’m excited to announce the release of Data Quality Fundamentals: A Practitioner’s Guide to Building More Trustworthy Data Pipelines , published by O’Reilly Media and available for free on the Monte Carlo website. This is the first book published by O’Reilly to educate the market on how best-in-class data teams design and architect technical systems to achieve trustworthy and reliable data at scale.

article thumbnail

Tuning Image Classifiers using Human-In-The-Loop

Zalando Engineering

In this blog post we describe an algorithm we developed when building our product image analysis infrastructure, where we use human-in-the-loop to tune the thresholds of our image classifiers. We discuss the algorithm in the following, and present some mathematical details and a simple code example in the appendices. Background When a customer browses for a product on the Zalando website they may use descriptive terms to search for what they want, for example a customer may use a specific term s

article thumbnail

Databricks Execution Plans

Advancing Analytics: Data Engineering

The execution plans in Databricks allows you to understand how code will actually get executed across a cluster and is useful for optimising queries. It translates operations into optimized logical and physical plans and shows what operations are going to be executed and sent to the Spark Executors. Execution Flow. Databricks uses Catalyst optimizer, which automatically discovers the most efficient plan to execute the operations specified.

article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.