Sat.Jun 02, 2018 - Fri.Jun 08, 2018

article thumbnail

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

Data Engineering Podcast

Summary Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.

article thumbnail

Programming Best Practices For Data Science

Dataquest

The data science life cycle is generally comprised of the following components: data retrieval data cleaning data exploration and visualization statistical or predictive modeling While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow. Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script.

article thumbnail

Recap of Hadoop News for May 2018

ProjectPro

News on Hadoop - May 2018 Data-Driven HR: How Big Data And Analytics Are Transforming Recruitment.Forbes.com, May 4, 2018. With platforms like LinkedIn and Glassdoor giving every employer access to valuable big data, the world of recruitment transforming to intelligent recruitment.HR teams that make use of big data in future are likely to be successful in recruiting the right talent in the coming years.

Hadoop 52
article thumbnail

Turning petabytes of pharmaceutical data into actionable insights

Cloudera

Authors: Mai N. Nguyen, Accenture & Mitch Gomulinski, Cloudera. Imagine storing the DNA of the entire population of the US – and then cloning them, twice. That’s the equivalent of 1 petabyte ( ComputerWeekly ) – the amount of unstructured data available within our large pharmaceutical client’s business. Then imagine the insights that are locked in that massive amount of data.

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.