October, 2020

article thumbnail

How to submit Spark jobs to EMR cluster from Airflow

Start Data Engineering

Table of Contents Table of Contents Introduction Design Setup Prerequisites Clone repository Get data Code Move data and script to the cloud create an EMR cluster add steps and wait to complete terminate EMR cluster Run the DAG Conclusion Further reading Introduction I have been asked and seen the questions how others are automating apache spark jobs on EMR how to submit spark jobs to an EMR cluster from Airflow ?

Cloud 130
article thumbnail

Top 5 Things Every Kafka Developer Should Know

Confluent

Apache Kafka® is an event streaming platform used by more than 30% of the Fortune 500 today. There are numerous features of Kafka that make it the de-facto standard for […].

Kafka 145
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

UK Government: From cloud first to cloud appropriate?

Cloudera

Since 2013 the UK Government’s flagship ‘Cloud First’ policy has been at the forefront of enabling departments to shed their legacy IT architecture in order to meaningfully embrace digital transformation. The policy outlines that the cloud (and specifically, public cloud) be the default position for any new services; unless it can be demonstrated that other alternatives offer better value for money. .

article thumbnail

Cloud Native Data Security As Code With Cyral

Data Engineering Podcast

Summary One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other.

article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

article thumbnail

Data: The Crumbling Foundation of Finance, Our Once Trusted Advisor

Teradata

The most frequently asked question of Finance departments today is, ‘whose data do we trust’? Here’s how to ensure Finance always has the correct answer.

Finance 117
article thumbnail

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

by Maulik Pandey Our Team?—? Kevin Lew , Narayanan Arunachalam , Elizabeth Carretto , Dustin Haffner , Andrei Ushakov, Seth Katz , Greg Burrell , Ram Vaithilingam , Mike Smith and Maulik Pandey “ @Netflixhelps Why doesn’t Tiger King play on my phone?”?—?a Netflix member via Twitter This is an example of a question our on-call engineers need to answer to help resolve a member issue?

More Trending

article thumbnail

Intrusion Detection with ksqlDB

Confluent

Apache Kafka® is a distributed real-time processing platform that allows for the ingestion of huge volumes of data. ksqlDB is part of the Kafka ecosystem and offers a SQL-like language […].

Kafka 143
article thumbnail

Data security vs usability: you can have it all

Cloudera

Growing up, were you ever told you can’t have it all? That you can’t eat all the snacks in one sitting? That you can’t watch the complete Back to the Future trilogy as well as study for your science exam in one evening? Over time, we learn to set priorities, make a decision for one thing over the other, and compromise. Just like when it comes to data access in business.

article thumbnail

Better Data Quality Through Observability With Monte Carlo

Data Engineering Podcast

Summary In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests.

article thumbnail

Watch Out for Gotchas in Cloud Data Warehouse Pricing

Teradata

Successful companies need to squeeze maximum value from all of their data & do it at the lowest possible cost. But they often get hit with unexpected budget overruns. Teradata can help.

article thumbnail

Launching LLM-Based Products: From Concept to Cash in 90 Days

Speaker: Christophe Louvion, Chief Product & Technology Officer of NRC Health and Tony Karrer, CTO at Aggregage

Christophe Louvion, Chief Product & Technology Officer of NRC Health, is here to take us through how he guided his company's recent experience of getting from concept to launch and sales of products within 90 days. In this exclusive webinar, Christophe will cover key aspects of his journey, including: LLM Development & Quick Wins 🤖 Understand how LLMs differ from traditional software, identifying opportunities for rapid development and deployment.

article thumbnail

The Curse of Dimensionality

Domino Data Lab: Data Engineering

Danger of Big Data Big data is the rage. This could be lots of rows (samples) and few columns (variables) like credit card transaction data, or lots of columns (variables) and few rows (samples) like genomic sequencing in life sciences research. The Curse of Dimensionality , or Large P, Small N, ((P >> N)) problem applies to the latter case of lots of variables measured on a relatively few number of samples.

article thumbnail

Fullstack Typescript - create an API

Grouparoo

Two of the major components of the @grouparoo/core application are a Node.js API server and a React frontend. We use Actionhero as the API server, and Next.JS for our React site generator. As we develop the Grouparoo application, we are constantly adding new API endpoints and changing existing ones. One of the great features of Typescript is that it can help not only to share type definitions within a codebase, but also across multiple codebases or services.

article thumbnail

Introducing Confluent Platform 6.0

Confluent

Each month, we’ve announced a set of Confluent features organized around what we think are the key foundational traits of cloud-native data systems as part of Project Metamorphosis. Data systems […].

Project 142
article thumbnail

What you need to know to begin your journey to CDP

Cloudera

Recently, my colleague published a blog build on your investment by Migrating or Upgrading to CDP Data Center , which articulates great CDP Private Cloud Base features. Existing CDH and HDP customers can immediately benefit from this new functionality. This blog focuses on the process to accelerate your CDP journey to CDP Private Cloud Base for both professional services engagements and self-service upgrades.

article thumbnail

How To Speak The Language Of Financial Success In Product Management

Speaker: Jamie Bernard

Success in product management goes beyond delivering great features - it’s about achieving measurable financial outcomes that resonate across the organization. By connecting your product’s journey with the company’s financial success, you’ll ensure that every feature, release, and innovation contributes to the bottom line, driving both customer satisfaction and business growth.

article thumbnail

Rapid Delivery Of Business Intelligence Using Power BI

Data Engineering Podcast

Summary Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options.

article thumbnail

How to Prioritize "Self" in Today's World: A Summary on Mental Health

Teradata

In honor of World Mental Health Day this past weekend, Shehzeen Rehman writes on the importance of de-stigmatizing mental health and learning how to seek help.

98
article thumbnail

Why I Am Joining Rockset

Rockset

I’m excited to soon be the newest member of Rockset. I will be joining a truly spectacular engineering team, working on a product that leverages deep technical insights to make real-time analytics easy. My passion is building infrastructure that makes things simpler for users, supporting people at higher levels of the stack by giving them clean APIs and predictable behavior.

article thumbnail

Build a Slack Dashboard (Part 2): Loading Into Postgres & Creating Basic Charts

Preset

Build a beautiful Slack dashboard using open source tools Meltano and Superset. Part 2 of 3.

article thumbnail

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.

article thumbnail

Preparing Your Clients and Tools for KIP-500: ZooKeeper Removal from Apache Kafka

Confluent

As described in the blog post Apache Kafka® Needs No Keeper: Removing the Apache ZooKeeper Dependency, when KIP-500 lands next year, Apache Kafka will replace its usage of Apache ZooKeeper […].

Kafka 140
article thumbnail

Women at work: a continuous state of evolution

Cloudera

Being a woman in tech can be incredibly rewarding, lonely, frustrating and inspiring all at once. Each individual has their own experience and path that they’ve followed to get where they are. That is, after all, what makes us unique. Earlier this week Cindy Maike, VP Industry Solutions, hosted a panel discussion with women across the Cloudera EMEA business, working in a variety of different roles; each of us with diverse backgrounds and perspectives, which made for a wide-ranging discussion. .

article thumbnail

Self Service Real Time Data Integration Without The Headaches With Meroxa

Data Engineering Podcast

Summary Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading.

article thumbnail

Announcing Vantage on Google Cloud

Teradata

Teradata Vantage on Google Cloud is now generally available! Vantage on Google Cloud is an as-a-service offer in which customers can get the most analytic value from their data. Read more.

article thumbnail

Provide Real Value in Your Applications with Data and Analytics

The complexity of financial data, the need for real-time insight, and the demand for user-friendly visualizations can seem daunting when it comes to analytics - but there is an easier way. With Logi Symphony, we aim to turn these challenges into opportunities. Our platform empowers you to seamlessly integrate advanced data analytics, generative AI, data visualization, and pixel-perfect reporting into your applications, transforming raw data into actionable insights.

article thumbnail

Rockset Raises $40M Series B to Empower Developers Building Real-Time Analytics

Rockset

Today, Rockset announced $40M in Series B funding from Sequoia and Greylock , our two investors who have partnered with us right from the beginning. Additionally, we announced support for fully managed, secure private deployments of Rockset within a customer’s Amazon VPC. These are important milestones for both our company and product, but this announcement is less a celebration of Rockset than a recognition of our hundreds of beloved customers who have launched amazing real-time applications.

article thumbnail

Build A StackOverflow Dashboard (Part 2): Crafting BigQuery Views and Superset Charts

Preset

In part 2, we'll start to visualize trends using Superset charts.

article thumbnail

How Real-Time Materialized Views Work with ksqlDB, Animated

Confluent

All around the world, companies are asking the same question: What is happening right now? We are inundated with pieces of data that have a fragment of the answer. But […].

Data 120
article thumbnail

7 New Ways Cloudera Is Investing in Our Culture

Cloudera

As Cloudera offices around the world continue to cope with the impact of COVID-19, we have worked hard to ease stress and adapt to remote working. People are the heart of our company and we’re investing in creative, new ways to make every Clouderan feel valued and appreciated. Clouderans are superstars at work and at home, and burn-out is unhealthy for employees, their families, and the company.

Designing 104
article thumbnail

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Speaker: Timothy Chan, PhD., Head of Data Science

Are you ready to move beyond the basics and take a deep dive into the cutting-edge techniques that are reshaping the landscape of experimentation? 🌐 From Sequential Testing to Multi-Armed Bandits, Switchback Experiments to Stratified Sampling, Timothy Chan, Data Science Lead, is here to unravel the mysteries of these powerful methodologies that are revolutionizing how we approach testing.

article thumbnail

Netflix Android and iOS Studio Apps?—?now powered by Kotlin Multiplatform

Netflix Tech

Netflix Android and iOS Studio Apps?—?now powered by Kotlin Multiplatform By David Henry & Mel Yahya Over the last few years Netflix has been developing a mobile app called Prodicle to innovate in the physical production of TV shows and movies. The world of physical production is fast-paced, and needs vary significantly between the country, region, and even from one production to the next.

Coding 111
article thumbnail

Survey: Enterprise Data More Important Than Ever Since Onset of COVID-19

Teradata

Our new global survey reveals how business leaders are changing the way they think about about data -- from their trust in to to the role it plays in a post-pandemic recovery.

Data 64
article thumbnail

3 Tools to Help Debug Slow Queries in MongoDB

Rockset

Regardless of what database you pick to run your application—MongoDB, Postgres, Oracle, or Cassandra—you will eventually encounter the same issue: slow queries. Slow queries can be the result of inefficient query design, inefficient table design, or general infrastructure problems. Although it may be tempting to add more machines or further complicate your data infrastructure to speed up your queries, improving the queries themselves is usually the best place to start when you want to improve da

MongoDB 40
article thumbnail

The Superset REST API

Preset

A high level tour of Apache Superset's REST API

40
article thumbnail

The AI Superhero Approach to Product Management

Speaker: Conrado Morlan

In this engaging and witty talk, industry expert Conrado Morlan will explore how artificial intelligence can transform the daily tasks of product managers into streamlined, efficient processes. Using the lens of a superhero narrative, he’ll uncover how AI can be the ultimate sidekick, aiding in data management and reporting, enhancing productivity, and boosting innovation.