March, 2021

article thumbnail

Building a Data Engineering Project in 20 Minutes

Simon Späti

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster. The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence

article thumbnail

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

Just an illustration – not the truth and you certainly can do it with other technologies. TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. the selfserve platform based on a serverless philisophy (life is too short to do provisioning) the building of data products (as code) : we are building data workflows not data pipelines the promotion of data domains where the metadata on the data life cycle is as important as your data The old dat

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

Confluent

At the heart of Apache Kafka® sits the log—a simple data structure that uses sequential operations that work symbiotically with the underlying hardware. Efficient disk buffering and CPU cache usage, […].

Kafka 145
article thumbnail

Data Quality Management For The Whole Team With Soda Data

Data Engineering Podcast

Summary Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

How to trigger a spark job from AWS Lambda

Start Data Engineering

Event driven pipelines Lambda function to trigger spark jobs Setup and run Monitoring and logging Teardown Conclusion Further reading References Event driven pipelines Event driven systems represent a software design pattern where a logic is executed in response to an event. This event can be a file creation on S3, a new database row, API call, etc.

AWS 100
article thumbnail

Remote Workstations for the Discerning Artists

Netflix Tech

By Michelle Brenner Netflix is poised to become the world’s most prolific producer of visual effects and original animated content. To meet that demand, we need to attract the world’s best artistic talent. Artists like to work at places where they can create groundbreaking entertainment instead of worrying about getting access to the software or source files they need.

More Trending

article thumbnail

Towards a Data Mesh (part 1) : Data Domains and Teams Topologies.

François Nguyen

Just an illustration – not the truth and we will pivot if it does not work. I discovered Zhamak Dehghani’s first article about Data Mesh in August 2020. Thanks to Youtube, you have the live illustration in this video with even more context and explanations. And then, you have this second video that is an introduction to her second article (december 2020).

article thumbnail

Under the Hood of Real-Time Analytics with Apache Kafka and Pinot

Confluent

Real-time analytics has become the need of the hour for modern internet companies. The ability to derive internal insights around business metrics, user growth and adoption as well as security […].

Kafka 144
article thumbnail

Real World Change Data Capture At Datacoral

Data Engineering Podcast

Summary The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems.

article thumbnail

CFO Analytics: What Is It and Why Should You Care?

Teradata

Finance-driven analytics might be the largest untapped opportunity for organizations & a catalyst for driving business value & strategic vision. But, what exactly is CFO analytics?

IT 119
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Top Three Requirements for Data Flows

Cloudera

Data flows are an integral part of every modern enterprise. No matter whether they move data from one operational system to another to power a business process or fuel central data warehouses with the latest data for near-real-time reporting, life without them would be full of manual, tedious and error-prone data modification and copying tasks. At Cloudera, we’re helping our customers implement data flows on-premises and in the public cloud using Apache NiFi , a core component of Cloudera DataFl

Cloud 119
article thumbnail

ConsoleMe: A Central Control Plane for AWS Permissions and Access

Netflix Tech

ConsoleMe: A Central Control Plane for AWS Permissions and Access By Curtis Castrapel , Patrick Sanders , and Hee Won Kim At AWS re:Invent 2020, we open sourced two new tools for managing multi-account AWS permissions and access. We’re very excited to bring you ConsoleMe (pronounced: kuhn-soul-mee ), and its CLI utility, Weep (pun intended)! If you missed the talk, check it out here.

AWS 102
article thumbnail

Reverse ETL with dbt and Grouparoo

Grouparoo

Teams are centralizing their data in their data warehouse by loading data in and transforming it as necessary. Increasingly, we are seeing teams turn to dbt to do this transforming. The idea is to write *.sql files that, when run in the right order, create useful rollup tables or materialized views of the data. We've been asked by teams using dbt how Grouparoo can then sync their data to their cloud-based apps.

article thumbnail

Monitoring Your Event Streams: Integrating Confluent with Prometheus and Grafana

Confluent

Self-managing a highly scalable distributed system with Apache Kafka® at its core is not an easy feat. That’s why operators prefer tooling such as Confluent Control Center for administering and […].

Kafka 131
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Managing The DoorDash Data Platform

Data Engineering Podcast

Summary The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner.

article thumbnail

How to Host a Virtual Global Data Science Hackathon

Teradata

Learn how best to host a virtual hackathon, or any virtual event, with these tips and tricks from our Teradata team. Read more.

article thumbnail

Congratulations to our 2021 Partner Award Winners

Cloudera

We announced at our Partner Sales Kickoff, the winners of the 2021 Cloudera Partner Awards. These six awards recognize Cloudera partners who are dedicated to enabling customers to do more with their data by leveraging the power of an enterprise data cloud. Thank you to this year’s winners for their partnership in helping our joint customers’ ability to drive value from their data in the hybrid cloud.

article thumbnail

The Netflix Cosmos Platform

Netflix Tech

Orchestrated Functions as a Microservice by Frank San Miguel on behalf of the Cosmos team Introduction Cosmos is a computing platform that combines the best aspects of microservices with asynchronous workflows and serverless functions. Its sweet spot is applications that involve resource-intensive algorithms coordinated via complex, hierarchical workflows that last anywhere from minutes to years.

Media 93
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

article thumbnail

Software Security at Rocketship Pace

Afterpay Tech

Photo by Yancy Min on Unsplash By: Alex Rosenzweig Overview Helping our engineers build amazing products that are worthy of our customer’s trust is the job of our product security team. One of the tools in the product security team’s arsenal is our code scanning platform which we lovingly call “Intersect” Effective code scanning is a core building block of a modern security program.

Coding 52
article thumbnail

How to Tune RocksDB for Your Kafka Streams Application

Confluent

Apache Kafka ships with Kafka Streams, a powerful yet lightweight client library for Java and Scala to implement highly scalable and elastic applications and microservices that process and analyze data […].

Kafka 131
article thumbnail

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Data Engineering Podcast

Summary A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way.

IT 100
article thumbnail

Enterprise Data Operating Systems in the Cloud: Necessary, But Not Sufficient

Teradata

Getting your Cloud data architecture right starts with understanding which data products you need, the roles they perform, & the functional & non-functional characteristics that those roles demand.

Cloud 110
article thumbnail

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

article thumbnail

CDP Endpoint Gateway provides Secure Access to CDP Public Cloud Services running in private networks

Cloudera

Cloudera Data Platform (CDP) Public Cloud allows users to deploy analytic workloads into their cloud accounts. These workloads cover the entire data lifecycle and are managed from a central multi-cloud Cloudera Control Plane. CDP provides the flexibility to deploy these resources into public or private subnets. Nearly unanimously, we’ve seen customers deploy their workloads to private subnets.

article thumbnail

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

Netflix Tech

Stephanie Lane , Wenjing Zheng , Mihir Tendulkar Source credit: Netflix Within the rapid expansion of data-related roles in the last decade, the title Data Scientist has emerged as an umbrella term for myriad skills and areas of business focus. What does this title mean within a given company, or even within a given industry? It can be hard to know from the outside.

article thumbnail

Community, Metadata Management, and More: Top 10 Links From Across the Web

Data Council

Here's our March 2021 roundup of links from across the web that we selected for you: 1. How to Build a Community (Fishtown Analytics) Claire Carroll's first personal blog post on community-building is a must-read. As Fishtown Analytics' community manager for the last 2.5 years, she's arguably behind the success of the dbt community and its best-in-class practices, so we expected good advice… but she really hit the ball out of the park with this one!

article thumbnail

To Pull or to Push Your Data with Kafka Connect? That Is the Question.

Confluent

Today, every company is a data company. There are many different data pipeline, integration, and ingestion tools in the market, but before you can feed your data analytics needs, data […].

Kafka 126
article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

article thumbnail

Promisifying Your Node Callback Functions

Grouparoo

The Grouparoo application is written in JavaScript (Node). It uses the modern promise-based pattern ( async / await ) for reading and writing data asynchronously. And we do this a lot — we are a data sync tool! Every once in awhile we'll come across a JavaScript library that is written around the old callback-based pattern, where the error object is the first parameter in the callback function, followed by the result.

article thumbnail

Enhancing Customer Experience with Every Journey

Teradata

Big Tech giants dominate by using data to improve product & experience. The auto industry can emulate this by analyzing data to improve customer experience & guide individual choices.

Data 95
article thumbnail

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

Governance and the sustainable handling of data is a critical success factor in virtually all organizations. While Cloudera Data Platform (CDP) already supports the entire data lifecycle from ‘Edge to AI’, we at Cloudera are fully aware that enterprises have more systems outside of CDP. It is crucial to avoid that CDP becomes the next silo in your IT landscape.

article thumbnail

Production Media Management: Transforming Media Workflows by leveraging the Cloud

Netflix Tech

Written by Anton Margoline , Avinash Dathathri , Devang Shah and Murthy Parthasarathi. Credit to Netflix Studio’s Product, Design, Content Hub Engineering teams along with all of the supporting partner and platform teams. In this post, we will share a behind-the-scenes look at how Netflix delivers technology and infrastructure to help production crews create and exchange media during production and post production stages.

Media 71
article thumbnail

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.