June, 2022

article thumbnail

Data Orchestration Trends: The Shift From Data Pipelines to Data Products

Simon Späti

Data consumers, such as data analysts, and business users, care mostly about the production of data assets. On the other hand, data engineers have historically focused on modeling the dependencies between tasks (instead of data assets) with an orchestrator tool. How can we reconcile both worlds? This article reviews open-source data orchestration tools (Airflow, Prefect, Dagster) and discusses how data orchestration tools introduce data assets as first-class objects.

article thumbnail

Azure Data Factory: New Monitoring View Features

Azure Data Engineering

It is very easy to visually monitor previous pipeline runs in Data Factory using the Monitor page in the Azure Data Factory , which we have already covered in a previous post. There have been some recent improvements to the monitoring view, we will go through these briefly in this post. Data from the Azure Monitor view can be easily exported to csv by clicking on the newly added Export to CSV button.

Data 130
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Data Engineering Podcast

Summary The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system.

Datasets 130
article thumbnail

5 Steps to land a high paying data engineering job

Start Data Engineering

1. Introduction 2. Steps 2.1. Choosing companies to work for 2.2. Optimizing your Linkedin & resume 2.3. Landing interviews 2.4. Preparing for interviews 2.5. Offers & Negotiation 3. Conclusion 4. Further reading 5. Reference 1. Introduction The data industry is booming! & data engineering salaries are skyrocketing. But landing a new job is not an easy task.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Dynamic Task Mapping in Apache Airflow

Marc Lamberti

Dynamic Task Mapping is a new feature of Apache Airflow 2.3 that puts your DAGs to a new level. Now, you can create tasks dynamically without knowing in advance how many tasks you need. This feature is for you if you want to process various files, evaluate multiple machine learning models, or process a varied number of data based on a SQL request. Excited?

SQL 130
article thumbnail

An In-Depth Data Mesh Discussion with Zhamak Dehghani

Jesse Anderson

In 2021 I had the pleasure to first get to know and speak with Zhamak Dheghani, Director of Emerging Technologies at ThoughtWorks, in season one of the Data Dream Team series. Zhamak is a software engineer and architect who is (in)famously known as the founder of the data mesh concept, a paradigm shift in how we manage data-driven value at scale. I interviewed Zhamak last season as more of an introduction to Data Mesh.

More Trending

article thumbnail

Azure Data Factory: Script Activity

Azure Data Engineering

While we have discussed various ways for running custom SQL code in Azure Data Factory in a previous post , recently, a new activity has been added to Azure Data Factory called Script Activity , which provides a more flexible way of running custom SQL scripts. Azure Data Factory: Script Activity As shown in the screenshot above, this activity supports execution of custom Data Query Language (DQL) as well as Data Definition Language (DDL) and Data Manipulation Language (DML).

SQL 130
article thumbnail

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data g

Metadata 130
article thumbnail

24 SQL Questions You Might See on Your Next Interview

KDnuggets

Preparing for the SQL job interview can be overwhelming enough. You don’t need someone telling you that you need to know everything on top of that! Be smart and focus on preparing the SQL questions that appear most often at the job interview.

SQL 160
article thumbnail

The Future Is Hybrid Data, Embrace It

Cloudera

We live in a hybrid data world. In the past decade, the amount of structured data created, captured, copied, and consumed globally has grown from less than 1 ZB in 2011 to nearly 14 ZB in 2020. Impressive, but dwarfed by the amount of unstructured data, cloud data, and machine data – another 50 ZB. In fact, the total amount of data is expected to nearly triple by 2025.

IT 112
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Natively Connect Teradata QueryGrid to Google BigQuery

Teradata

With the Teradata QueryGrid Google BigQuery Connector, we’re enabling our customers to natively join data between Vantage and BigQuery in real-time, at scale.

Data 98
article thumbnail

Machine Learning Metrics: How to Measure the Performance of a Machine Learning Model

AltexSoft

Choosing the machine learning path when developing your software is half the success. Yes, it’s an advanced way of doing things. Yes, it brings automation, so widely discussed machine intelligence, and other awesome perks. But just because you put it there doesn’t guarantee your project will do well and pay off. So, how would you measure the success of a machine learning model?

article thumbnail

Azure Data Factory: Monitor Self Hosted Integration Runtime Metrics

Azure Data Engineering

Self-hosted integration runtime in the context of Azure data factory is a gateway that connects the on-prem data sources to datastores in the cloud. To know more about Integration runtimes, please refer to the previous post. We have discussed how to check whether Integration Runtime is online or offline using PowerShell command in a previous post. In today’s post, lets have a look at how to monitor self-hosted integration runtime metrics such as CPU utilization, Available memory, number of concu

Utilities 130
article thumbnail

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Data Engineering Podcast

Summary Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.

Metadata 130
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Primary Supervised Learning Algorithms Used in Machine Learning

KDnuggets

In this tutorial, we are going to list some of the most common algorithms that are used in supervised learning along with a practical tutorial on such algorithms.

Algorithm 159
article thumbnail

Moving Enterprise Data From Anywhere to Any System Made Easy

Cloudera

Since 2015, the Cloudera DataFlow team has been helping the largest enterprise organizations in the world adopt Apache NiFi as their enterprise standard data movement tool. Over the last few years, we have had a front-row seat in our customers’ hybrid cloud journey as they expand their data estate across the edge, on-premise, and multiple cloud providers.

Systems 104
article thumbnail

Modernizing a public health system with Teradata’s connected analytic architecture

Teradata

How do you accelerate disease prevention and response? Teradata provides a response to help accelerate public health infrastructure modernization.

article thumbnail

Autonomous Networks — The Telco and Media Growth Engine

Confluent

How real-time integrations between modern and legacy systems benefit communication service providers with autonomous network features, enhanced customer experiences, and more.

Media 86
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

article thumbnail

How Netflix Content Engineering makes a federated graph searchable (Part 2)

Netflix Tech

By Alex Hutter , Falguni Jhaveri , and Senthil Sayeebaba In a previous post , we described the indexing architecture of Studio Search and how we scaled the architecture by building a config-driven self-service platform that allowed teams in Content Engineering to spin up search indices easily. This post will discuss how Studio Search supports querying the data available in these indices.

article thumbnail

Discover And De-Clutter Your Unstructured Data With Aparavi

Data Engineering Podcast

Summary Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time an

article thumbnail

Generate Synthetic Time-series Data with Open-source Tools

KDnuggets

An introduction to the generative adversarial network model DoppelGANger, and how you can use a new open-source PyTorch implementation of it to create high-quality synthetic time-series data.

Data 155
article thumbnail

Cloudera’s Applied ML Prototype Catalog Continues to Grow

Cloudera

Here at Cloudera, we’re committed to helping make the lives of data practitioners as painless as possible. For data scientists, we continue to provide new Applied Machine Learning Prototypes (AMPs), which are open source and available on GitHub. These pre-built reference examples are complete end-to-end data science projects. In Cloudera Machine Learning (CML), you can deploy them with the single click of a button, bringing data scientists that much closer to providing value.

article thumbnail

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

article thumbnail

A Model Implementation

Teradata

How do you take the first steps to free the power of analytics from on-premise systems whilst protecting valuable data and de-risking transformation? Find out more.

Systems 85
article thumbnail

Introducing the Current 2022 Program Committee

Confluent

The committee will ensure Current has the best speakers from top companies in every industry, and cover all streaming data technologies.

article thumbnail

Scaling Appsec at Netflix (Part 2)

Netflix Tech

By Astha Singhal , Lakshmi Sudheer , Julia Knecht The Application Security teams at Netflix are responsible for securing the software footprint that we create to run the Netflix product, the Netflix studio, and the business. Our customers are product and engineering teams at Netflix that build these software services and platforms. The Netflix cultural values of ‘Context not Control’ and ‘Freedom and Responsibility’ strongly influence how we do Security at Netflix.

article thumbnail

Strategies And Tactics For A Successful Master Data Management Implementation

Data Engineering Podcast

Summary The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to mak

article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

article thumbnail

Learn MLOps with This Free Course

KDnuggets

Learn to train and track your experiments, create ML pipelines, model deployment, monitor the performance in production, and adopt best practices from DevOps.

159
159
article thumbnail

#ClouderaLife Spotlight: Hassan Mirza

Cloudera

In this #ClouderaLife Spotlight Hassan talks about three life themes that have kept him moving and motivated: learning from his father’s work ethic despite his family’s forcible displacement from their country of origin, his early experience with organized sports, and the value of mentorship. Hassan describes how these experiences led him to give back to his family and community by becoming a Mental Health First Aider and a mentor for refugees seeking a better life.

article thumbnail

Operational excellence—data ensures airlines maintain the right trajectory

Teradata

Learn how data and analytics can enable airlines to navigate towards more streamlined operations. Read more.

Data 98
article thumbnail

How to Elastically Scale Apache Kafka Clusters on Confluent Cloud

Confluent

How to elastically scale Kafka clusters from 0 to 100 MB/s and back with automatic cluster resizing, data rebalancing, real-time consumption optimization, and monitoring in seconds.

Kafka 81
article thumbnail

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.