March, 2020

article thumbnail

How to process simple data stream and consume with Lambda

Team Data Science

I built a serverless architecture for my simulated credit card complaints stream using, AWS S3 AWS Lambda AWS Kinesis the above picture gives a high-level view of the data flow. I assume uploading the CSV file as a data producer, so once you upload a file, it generates object created event and the Lambda function is invoked asynchronously. The file data content will be written to the Kinesis stream as a record (record = data + partition key), which triggers another Lambda function and persist th

Process 130
article thumbnail

Scheduling a SQL script, using Apache Airflow, with an example

Start Data Engineering

One of the most common use cases for Apache Airflow is to run scheduled SQL scripts. Developers who start with Airflow often ask the following questions “How to use airflow to orchestrate sql?

SQL 130
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

20+ Machine Learning Datasets & Project Ideas

KDnuggets

Upgrading your machine learning, AI, and Data Science skills requires practice. To practice, you need to develop models with a large amount of data. Finding good datasets to work with can be challenging, so this article discusses more than 20 great datasets along with machine learning project ideas for you to tackle today.

article thumbnail

Why We Leverage Multi-tenancy in Uber’s Microservice Architecture

Uber Engineering

The performance of Uber’s services relies on our ability to quickly and stably launch new features on our platform , regardless of where the corresponding service lives in our tech stack. Foundational to our platform’s power is its microservice-based architecture … The post Why We Leverage Multi-tenancy in Uber’s Microservice Architecture appeared first on Uber Engineering Blog.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Ready for changes with Hexagonal Architecture

Netflix Tech

by Damir Svrtan and Sergii Makagon As the production of Netflix Originals grows each year, so does our need to build apps that enable efficiency throughout the entire creative process. Our wider Studio Engineering Organization has built more than 30 apps that help content progress from pitch (aka screenplay) to playback: ranging from script content acquisition, deal negotiations and vendor management to scheduling, streamlining production workflows, and so on.

article thumbnail

Advanced Analytics for Coronavirus – Trends, Patterns, Predictions

Teradata

Advanced analytics and AI can significantly accelerate data processing required to get the insights, answers and recommendations to handle and address the COVID-19 pandemic.

More Trending

article thumbnail

The Life Of A Non-Profit Data Professional

Data Engineering Podcast

Summary Building and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode Tyler Colby shares his experiences working as a data professional in the non-profit sector. From managing Salesforce data models to wrangling a multitude of data sources and compliance challenges, he describes the biggest challenges that he is facing.

AWS 100
article thumbnail

Coronavirus Data and Poll Analysis – yes, there is hope, if we act now

KDnuggets

We examine the growth of coronavirus daily cases in most affected countries, and show evidence that social distancing works in reducing the rate of spread. We also analyze KDnuggets Poll results - the scale of change to online and how Data Science work is likely to increase or drop in different regions. Stay Healthy and practice social distancing!

article thumbnail

10 Key skills, to help you become a data engineer

Start Data Engineering

This article gives you an overview of the 10 key skills you need to become a better data engineer. If you are struggling to get started on what to learn, start with the first topic and proceed through the list.

article thumbnail

15 Things Every Apache Kafka Engineer Should Know About Confluent Replicator

Confluent

Single-cluster deployments of Apache Kafka® are rare. Most medium to large deployments employ more than one Kafka cluster, and even the smallest use cases include development, testing, and production clusters. […].

Kafka 122
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Improving Prediction of the Unconfirmed COVID-19 Cases

Teradata

With the lack of available tests & uncertainty around the true number of COVID-19 cases, Teradata Epidemiologist Daniel Ulatowski & Data Scientist Jack McCush hypothesize how symptomatic data & the Vantage ML Engine can be utilized to predict cases.

Utilities 128
article thumbnail

Introducing Dispatch

Netflix Tech

By Kevin Glisson, Marc Vilanova, Forest Monsen Netflix is pleased to announce the open-source release of our crisis management orchestration framework: Dispatch! Okay, but what is Dispatch? Put simply, Dispatch is: All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!

article thumbnail

Behind The Scenes Of The Linode Object Storage Service

Data Engineering Podcast

Summary There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit.

Media 100
article thumbnail

The 4 Best Jupyter Notebook Environments for Deep Learning

KDnuggets

Many cloud providers, and other third-party services, see the value of a Jupyter notebook environment which is why many companies now offer cloud hosted notebooks that are hosted on the cloud. Let's have a look at 3 such environments.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Learn to Optimize Algorithms in Our New Algorithm Complexity Course

Dataquest

Algorithms are at the center of almost any programming job. And particularly in the world of data engineering, using efficient algorithms is important enough that it’s a common topic to be quizzed about in job interviews. That’s why we’ve just launched a new course! Algorithm Complexity is the latest course in our Data Engineer career path.

article thumbnail

Kafka Connect Elasticsearch Connector in Action

Confluent

The Elasticsearch sink connector helps you integrate Apache Kafka® and Elasticsearch with minimum effort. You can take data you’ve stored in Kafka and stream it into Elasticsearch to then be […].

Kafka 121
article thumbnail

People, We Need to Talk About Mass Electronic Surveillance

Teradata

With the COVID-19 epidemic in full swing, the countries that are faring the best are employing large-scale testing and electronic surveillance. But what does this mean for our civil liberties?

article thumbnail

Open-Sourcing riskquant, a library for quantifying risk

Netflix Tech

Netflix has a program in our Information Security department for quantifying the risk of deliberate (attacker-driven) and accidental… Continue reading on Netflix TechBlog ».

article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

article thumbnail

Building A New Foundation For CouchDB

Data Engineering Podcast

Summary CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB.

Building 100
article thumbnail

What is the most effective policy response to the new coronavirus pandemic?

KDnuggets

Where Test/Trace/Quarantine are working, the number of cases/day have declined empirically. Furthermore, this appears to be a radically superior strategy where it can be deployed. I’ll review the evidence, discuss the other strategies and their consequences, and then discuss what can be done.

IT 142
article thumbnail

Query Lambdas: Increasing Developer Velocity for Application Development

Rockset

At Rockset we strive to make building modern data applications easy and intuitive. Data-backed applications come with an inherent amount of complexity - managing the database backend, exposing a data API (often using hard-coded SQL or an ORM to write queries), keeping the data and application code in sync. the list goes on. Just as Rockset has reimagined and dramatically simplified the traditional ETL pipeline on the data-loading side , we’re now proud to release a new product feature - Query La

SQL 52
article thumbnail

Building a Cloud ETL Pipeline on Confluent Cloud

Confluent

As enterprises move more and more of their applications to the cloud, they are also moving their on-prem ETL (extract, transform, load) pipelines to the cloud, as well as building […].

Cloud 119
article thumbnail

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

article thumbnail

Five Books Every CX Leader Should Read in this Time of Social Distancing

Teradata

Check out this curated reading list of books on customer experience. From updated classics to new research and insights into how large enterprises can drive business outcomes from a CX initiative.

59
article thumbnail

SVT-AV1: an open-source AV1 encoder and decoder

Netflix Tech

SVT-AV1: open-source AV1 encoder and decoder by Andrey Norkin , Joel Sole , Mariana Afonso , Kyle Swanson, Agata Opalach , Anush Moorthy , Anne Aaron SVT-AV1 is an open-source AV1 codec implementation hosted on GitHub [link] under a BSD + patent license. As mentioned in our earlier blog post , Intel and Netflix have been collaborating on the SVT-AV1 encoder and decoder framework since August 2018.

BI 62
article thumbnail

Scaling Data Governance For Global Businesses With A Data Hub Architecture

Data Engineering Podcast

Summary Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility.

article thumbnail

When Will AutoML replace Data Scientists? Poll Results and Analysis

KDnuggets

Will AI always be 5-10 years away? The majority of respondents to this poll think that AutoML will reach expert level in 5-10 years. Interestingly, it is about the same as 5 years ago. We examine the trends by AutoML experience, industry, and region.

Data 136
article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

article thumbnail

On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies

Airbnb Tech

One of the most common ways to store results from a Spark job is by writing the results to a Hive table stored on HDFS. While in theory, managing the output file count from your jobs should be simple, in reality, it can be one of the more complex parts of your pipeline. Author : Zachary Ennenga Airbnb’s new office building, 650 Townsend Background At Airbnb, our offline data processing ecosystem contains many mission-critical, time-sensitive jobs — it is essential for us to maximize the stabilit

article thumbnail

Sharpening your Stream Processing Skills with Kafka Tutorials

Confluent

In the Apache Kafka® ecosystem, ksqlDB and Kafka Streams are two popular tools for building event streaming applications that are tightly integrated with Apache Kafka. While ksqlDB and Kafka Streams […].

Kafka 115
article thumbnail

Saudi Telecom Company

Teradata

STC uses Teradata to serve each segment as one team, increasing response rates, customer satisfaction, and revenue as well as reducing operating and call center costs.

52
article thumbnail

How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience

Netflix Tech

By Ben Sykes Continue reading on Netflix TechBlog ».

Kafka 98
article thumbnail

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.