July, 2019

article thumbnail

Ten more random useful things in R you may not know about

KDnuggets

I had a feeling that R has developed as a language to such a degree that many of us are using it now in completely different ways. This means that there are likely to be numerous tricks, packages, functions, etc that each of us use, but that others are completely unaware of, and would find useful if they knew about them.

IT 123
article thumbnail

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Confluent

Using Jaeger tracing, I’ve been able to answer an important question that nearly every Apache Kafka ® project that I’ve worked on posed: how is data flowing through my distributed system? Quick disclaimer: if you’re simply looking for an answer to that question, this post won’t provide that answer directly. Instead, in this post I will point you to an earlier blog post where I already answered that question and then I will focus on what should be your next question: now that I’m relying on Jaege

Kafka 54
article thumbnail

Our Commitment to Open Source Software

Cloudera

Open source has been core to the missions of both Hortonworks and Cloudera and central to our values and culture. With more than 700 engineers in the new Cloudera, our company writes a prodigious amount of open source code each year that’s contributed to more than 30 different open source projects. We’re also a very innovative open source company, having collectively launched more than a dozen new open source projects since the founding of the two companies. .

article thumbnail

The Power of Integrated Data and Analytics

Teradata

Integrated data and analytics has a proven track record of helping organize operations, enhance customer experience and improve revenue and market growth.

Data 104
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Simplifying Data Integration Through Eventual Connectivity

Data Engineering Podcast

Summary The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration.

article thumbnail

Evolution of Netflix Conductor:

Netflix Tech

v2.0 and beyond By Anoop Panicker and Kishore Banala Conductor is a workflow orchestration engine developed and open-sourced by Netflix. If you’re new to Conductor, this earlier blogpost and the documentation should help you get started and acclimatized to Conductor. Netflix Conductor: A microservices orchestrator In the last two years since inception, Conductor has seen wide adoption and is instrumental in running numerous core workflows at Netflix.

More Trending

article thumbnail

Getting started with the MongoDB Connector for Apache Kafka and MongoDB

Confluent

Together, MongoDB and Apache Kafka ® make up the heart of many modern data architectures today. Integrating Kafka with external systems like MongoDB is best done though the use of Kafka Connect. This API enables users to leverage ready-to-use components that can stream data from external systems into Kafka topics, as well as stream data from Kafka topics into external systems.

MongoDB 21
article thumbnail

Solving the Pain Points of Big Data Management

Cloudera

Every business aims to deliver products and services quickly and efficiently based upon customer wants and needs. Today, much of that speed and efficiency relies on insights driven by big data. Yet big data management often serves as a stumbling block, because many businesses continue to struggle with how to best capture and analyze their data. Unorganized data presents another roadblock.

article thumbnail

How Analytics Answer the Most Challenging Business Questions

Teradata

Analytics can help enterprises answer the toughest business questions by leveraging all of the data across an organization.

Data 80
article thumbnail

Straining Your Data Lake Through A Data Mesh

Data Engineering Podcast

Summary The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh.

Data Lake 100
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Re-Architecting the Video Gatekeeper

Netflix Tech

By Drew Koszewnik This is the story about how the Content Setup Engineering team used Hollow, a Netflix OSS technology, to re-architect and simplify an essential component in our content pipeline?—?delivering a large amount of business value in the process. The Context Each movie and show on the Netflix service is carefully curated to ensure an optimal viewing experience.

article thumbnail

Fantastic Four of Data Science Project Preparation

KDnuggets

This article takes a closer look at the four fantastic things we should keep in mind when approaching every new data science project.

article thumbnail

KSQL in Football: FIFA Women’s World Cup Data Analysis

Confluent

One of the football (as per European terminology) highlights of the summer is the FIFA Women’s World Cup. France, Brazil, and the USA are the favourites, and this year Italy is present at the event for the first time in 20 years. From a data perspective, the World Cup represents an interesting source of information. There’s a lot of dedicated press coverage, as well as the standard social media excitement following any kind of big event.

article thumbnail

Educating Data Analysts at Scale: Cloudera Launches Modern Big Data Analysis with SQL on Coursera

Cloudera

At a time when machine learning, deep learning, and artificial intelligence capture an outsize share of media attention, jobs requiring SQL skills continue to vastly outnumber jobs requiring those more advanced skills. Influential data scientists often point to SQL as the most important yet underrated skill for anyone who works with data. SQL is today—and will remain for the foreseeable future—a vital foundational skill for a wide range of data professionals working in different roles across dif

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

How to Enjoy Hybrid Partitioning with Teradata Columnar

Teradata

Teradata Vantage's NewSQL Engine's performance-enhancing options include column-row hybrid partitioning. Find out how to take advantage of this great feature.

article thumbnail

Data Labeling That You Can Feel Good About With CloudFactory

Data Engineering Podcast

Summary Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box.

article thumbnail

Bringing Rich Experiences to Memory-constrained TV Devices

Netflix Tech

Bringing Rich Experiences to Memory-Constrained TV Devices By Jason Munning, Archana Kumar, Kris Range Netflix has over 148M paid members streaming on more than half a billion devices spanning over 1,900 different types. In the TV space alone, there are hundreds of device types that run the Netflix app. We need to support the same rich Netflix experience on not only high-end devices like the PS4 but also memory and processor-constrained consumer electronic devices that run a similar chipset as w

article thumbnail

Convolutional Neural Networks: A Python Tutorial Using TensorFlow and Keras

KDnuggets

Different neural network architectures excel in different tasks. This particular article focuses on crafting convolutional neural networks in Python using TensorFlow and Keras.

Python 123
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Bust the Burglars – Machine Learning with TensorFlow and Apache Kafka

Confluent

Have you ever realized that, according to the latest FBI report , more than 80% of all crimes are property crimes, such as burglaries? And that the FBI clearance figures indicate that only 13% of all burglaries in 2017 were cleared due to lack of witnesses and/or physical evidence? How cool would it be to build your own burglar alarm system that can alert you before the actual event takes place simply by using a few network-connected cameras and analyzing the camera images with Apache Kafka ® ,

article thumbnail

Crafting the Perfect Internship Playlist

Pandora Engineering

Credit: Kanok Sulaiman Disclaimer: These are my experiences from being a Pandora software developer intern in the summer of 2019. All opinions expressed are my own, and represent no one except myself. I recently spent the last summer of my undergraduate program as an intern for Pandora Media in Oakland, CA. I gained a lot from my experience, and I’m writing this post to detail the application process, the lessons that I learned, and the company culture.

Java 52
article thumbnail

Enterprise Data Strategy: The Upside of Scarce Funding

Teradata

In a cost-cutting culture, directly linking data projects to top business initiatives is a good way to keep them from getting clipped. Learn more.

Data 73
article thumbnail

Scale Your Analytics On The Clickhouse Data Warehouse

Data Engineering Podcast

Summary The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Introduction to Streaming Data

Cloud Academy

Designing a streaming data pipeline presents many challenges, particularly around specific technology requirements. When designing a cloud-based solution, an architect is no longer faced with the question, “How do I get this job done with the technology we have?” but rather, “What is the right technology to support my use case?” In this blog post, we will walk through some initial scoping steps and walk through an example.

article thumbnail

Top 13 Skills To Become a Rockstar Data Scientist

KDnuggets

Education, coding, SQL, big data platforms, storytelling and more. These are the 13 skills you need to master to become a rockstar data scientist.

Education 123
article thumbnail

Kafka Listeners – Explained

Confluent

This question comes up on Stack Overflow and such places a lot , so here’s something to try and help. tl;dr: You need to set advertised.listeners (or KAFKA_ADVERTISED_LISTENERS if you’re using Docker images) to the external address (host/IP) so that clients can correctly connect to it. Otherwise, they’ll try to connect to the internal host address—and if that’s not reachable, then problems ensue.

Kafka 101
article thumbnail

Open Source: June Updates - New releases, continue to foster diversity and inclusion in tech

Zalando Engineering

Project Highlights Kopf - Kubernetes Operator Pythonic Framework now supports built-in resources and can be used to write controllers of any kind (pods, namespaces, mixed), not only of custom resources. Check out the latest release for more details [link] Skipper publishes new releases weekly. Some of the important features were implemented such as support to proxy Kubernetes API server and support Kubernetes externalName services from ingress.

AWS 52
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Five Steps to a Successful Upgrade

Teradata

Learn the five key steps to minimize downtime and ensure a seamless analytics software upgrade.

71
article thumbnail

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

Data Engineering Podcast

Summary Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at diffe

Kafka 100
article thumbnail

What is Data Extraction and How It Can Serve Your Business

InData Labs

In the highly competitive business world of today, data reign supreme. Customer personal data, comprehensive operating statistics, sales figures, or inter-company information may play a core role in strategic decision making. It’s vital to keep an eye on the quantity and quality of data that can be captured and extracted from different web sources.

IT 52
article thumbnail

Understanding Tensor Processing Units

KDnuggets

The Tensor Processing Unit (TPU) is Google's custom tool to accelerate machine learning workloads using the TensorFlow framework. Learn more about what TPUs do and how they can work for you.

Process 123
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.