This site uses cookies to improve your experience. To help us insure we adhere to various privacy regulations, please select your country/region of residence. If you do not select a country, we will assume you are from the United States. Select your Cookie Settings or view our Privacy Policy and Terms of Use.
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Used for the proper function of the website
Used for monitoring website traffic and interactions
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Strictly Necessary: Used for the proper function of the website
Performance/Analytics: Used for monitoring website traffic and interactions
1. Introduction 2. What is a staging area 3. The advantages of having a staging area 5. Conclusion 6. Further reading 1. Introduction Working with data pipelines, you might have noticed a staging area in most data pipelines. If you work in the data space and have questions like Why is there a staging area? Can’t we just load data into the destination tables?
Introduction. In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Within the context of a data mesh architecture, I will present industry settings / use cases where the particular architecture is relevant and highlight the business value that it delivers against business and technology areas.
Today, an organization’s strategic objective is to deliver innovations for a connected life and to improve the quality of life worldwide. With connected devices comes data, and with data comes […].
Martin Tingley with Wenjing Zheng , Simon Ejdemyr , Stephanie Lane , and Colin McFarland This is the third post in a multi-part series on how Netflix uses A/B tests to inform decisions and continuously innovate on our products. Need to catch up? Have a look at Part 1 (Decision Making at Netflix) and Part 2 (What is an A/B Test?). Subsequent posts will go into more details on experimentation across Netflix, how Netflix has invested in infrastructure to support and scale experimentation, and the i
Speaker: Jason Chester, Director, Product Management
In today’s manufacturing landscape, staying competitive means moving beyond reactive quality checks and toward real-time, data-driven process control. But what does true manufacturing process optimization look like—and why is it more urgent now than ever? Join Jason Chester in this new, thought-provoking session on how modern manufacturers are rethinking quality operations from the ground up.
1. Introduction 2. Business requirements: dashboards and analytics 3. What is a data warehouse 4. OLTP vs OLAP based data warehouses 5. Conclusion 6. Further reading 7. References 1. Introduction If you are a student, analyst, engineer, or anyone in the data space, it’s important to understand what a data warehouse is. If you are wondering What is a data warehouse?
October sees the launch of Partner Appreciation Month and during the next few weeks we will be sharing success stories, updates and interviews with our valued partners across the world. . We’re on a mission to make data and analytics easy and accessible, for everyone, and the hybrid data cloud is how we’ll get there. Today’s world is a hybrid world—there’s hybrid data, hybrid infrastructure, hybrid work—and leading businesses are embracing these changes, unafraid to transform their processes and
Summary The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes.
Summary The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes.
An interdisciplinary team from Volkswagen, AWS and Teradata have created an intelligent solution that enables greater transparency and efficiency in car body construction. Find out more.
Cloudera Data Platform (CDP) supports access controls on tables and columns, as well as on files and directories via Apache Ranger since its first release. It is common to have different workloads using the same data – some require authorizations at the table level (Apache Hive queries) and others at the underlying files (Apache Spark jobs). Unfortunately, in such instances you would have to create and maintain separate Ranger policies for both Hive and HDFS, that correspond to each othe
Summary Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine.
ETL and ELT are some of the most common data engineering use cases, but can come with challenges like scaling, connectivity to other systems, and dynamically adapting to changing data sources. Airflow is specifically designed for moving and transforming data in ETL/ELT pipelines, and new features in Airflow 3.0 like assets, backfills, and event-driven scheduling make orchestrating ETL/ELT pipelines easier than ever!
By Minal Mishra Quality of a client application is of paramount importance to global digital products, as it is the primary way customers interact with a brand. At Netflix, we have significant investments in ensuring new versions of our applications are well tested. However, Netflix is available for streaming on thousands of types of devices and it is powered by hundreds of micro-services which are deployed independently, making it extremely challenging to comprehensively test internally.
Introduction. Apache Impala is a massively parallel in-memory SQL engine supported by Cloudera designed for Analytics and ad hoc queries against data stored in Apache Hive, Apache HBase and Apache Kudu tables. Supporting powerful queries and high levels of concurrency Impala can use significant amounts of cluster resources. In multi-tenant environments this can inadvertently impact adjacent services such as YARN, HBase, and even HDFS.
Summary Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system.
Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.
In this interview series we’ll share some of the stories that Daniel and I get to watch unfold at Pipeline Academy. Check out what our graduates have to say about the course, how they’ve tackled its challenges and what they are doing now with their new data engineering superpowers. Peter: Michele, it's great to see you again. Thanks for taking the time to have a chat with me.
If your organization is using multi-tenant big data clusters (and everyone should be), do you know the usage and cost efficiency of resources in the cluster by tenants? A chargeback or showback model allows IT to determine costs and resource usage by the actual analytic users in the multi-tenant cluster, instead of attributing those to the platform (“overhead’) or IT department.
Retail is one of the first industries that started leveraging the power of machine learning and artificial intelligence. There are machine learning projects for almost every retail use case - right from inventory management to customer satisfaction. Machine learning projects in retail directly convert into profits and increase an organization’s market share with better customer acquisition and satisfaction.
Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage
There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.
A DataOps Engineer owns the assembly line that’s used to build a data and analytic product. Data operations (or data production) is a series of pipeline procedures that take raw data, progress through a series of processing and transformation steps, and output finished products in the form of dashboards, predictions, data warehouses or whatever the business requires.
To drive deeper business insights and greater revenues, organizations — whether they are big or small — need quality data. But more often than not data is scattered across a myriad of disparate platforms, databases, and file systems. What’s more, that data comes in different forms and its volumes keep growing rapidly every day — hence the name of Big Data.
I first “joined” Monte Carlo exactly a year ago, as a data science intern. I met Lior , our co-founder, on Zoom in August of 2020. I had cast a volley of solicitous emails into my network — “(sort of) Stanford C.S. student looking to be (sort of) hired and avoid school for a while” — and one opportunity had come back from Oren and Glenn , former colleagues and now mentors of mine at GGV.
Microsoft Power BI is the most used business intelligence tool according to peer ratings on Gartner. Companies like Adobe, Heathrow, PharmD have shown their trust in Power BI and continue to do so year after year. For 14 consecutive years, Power BI has bagged the first position in Magic Quadrant. As Power BI customers keep increasing, companies look for Power BI developers and analysts who can drive their business analytics and intelligence tasks.
Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives
Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri
The objective of this blog We’ve learned that 17 minutes a month is all it takes to ensure alignment between our development team and your goals for BI. Lets break that down… 1 Power BI Adoption meeting (15 minutes) + A glance at our weekly Smartsheet report (30 seconds x 4) = A total of 17 minutes per month Effective communication is a fundamental element to success in the world of business intelligence.
In this post, we’ll discuss the difference between ETL vs ELT and when you might choose ETL or ELT. We’ll also include a flowchart to help walk you through the ETL vs ELT decision-making process. The difference between ETL vs ELT What’s the difference between ETL and ELT? The short answer is it’s all about […] The post ETL vs ELT flowchart: When to use each appeared first on A Cloud Guru.
Last month, we decided that we should all read a book and talk about it as a company. It was a fun experience and I think we made a good choice by picking 97 Things Every Data Engineer Should Know. This was the first book I have read in this series and I liked the format. It is made up of 97 small vignettes that are 2-3 pages each. This provided a nice overview of the breadth of topics that are relevant to data engineering including data warehouses/lakes, pipelines, metadata, security, complianc
With 67 zones, 140 edge locations, over 90 services, and 940163 organizations using GCP across 200 countries - GCP is slowly garnering the attention of cloud users in the market. Flexera’s State of Cloud report highlighted that 41% of the survey respondents showed the most interest in using Google Cloud Platform for their future cloud computing projects.
Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.
In most countries, students start learning in September. As data engineers, let’s follow their lead and learn something new, too! I’m Pasha Finkelshteyn , and I’ll be your guide through this month’s news. I’ll offer my impressions of developments and highlight ideas from the wider community. If you think I missed something worthwhile, ping me on Twitter and suggest a topic, link, or anything else.
Last week, Matt Turck and John Wu published the latest annual report on the state of data, the 2021 Machine Learning, AI and Data (MAD) Landscape. If you haven’t read it yet, we recommend it as a comprehensive snapshot of the intricate world of AI, machine learning, and data science & engineering. Our team enjoyed reading it. We represent several of the pixels on this chart (hey, cool!
This blog is a step-by-step guide for a beginner in NLP. If you are someone who wants to know what is the best way to learn NLP from scratch, then please go through our blog till the end. We assure you will build the confidence and gear up yourself to make a career transition into data science as an NLP Engineer. We will first begin with what are the essential subjects one must be aware of, to prepare them for diving into the world of Natural Language Processing.
In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate
We organize all of the trending information in your field so you don't have to. Join 37,000+ users and stay up to date on the latest articles your peers are reading.
You know about us, now we want to get to know you!
Let's personalize your content
Let's get even more personalized
We recognize your account from another site in our network, please click 'Send Email' below to continue with verifying your account and setting a password.
Let's personalize your content