The Xceptional Blog

Pandas, Caviar, and Deep Learning

Written by Natalie | Aug 8, 2018 8:11:04 AM

Analytics is redefining the world. Data is the new oil. Artificial intelligence (AI) is everywhere

It feels like we hear some variant of these phrases too often. But there is a reason why such statements have become so pervasive. Data has quickly become the most valuable and differentiating asset for many organizations as it opens up new revenue streams and unearths opportunities for improvement in almost every part of an organization.

Innovative companies are devoting an ever-increasing amount of resources to collect, store and analyze data – in order to squeeze every drop of value out of this asset. And the companies that haven’t been investing in advanced analytics technologies, like deep learning, risk exposing themselves to major disruptions, forcing them to quickly evolve or risk extinction.

As one would expect with any early-stage technology, as organizations prioritize analytics, the number of toolkits and use cases have proliferated. At times, it feels like there’s a new analytics tool or data framework every day. And right now, the area that is heating up the fastest AI, specifically deep learning using a more iterative and layered approach for things like image classification and natural language processing.

Data is at the heart of AI

From what I’ve seenhow an organization utilizes its DATA continues to be the differentiation that separates the leaders from the rest of the pack. To be more precise, it’s in the types of data, the depth of data, and the uniqueness of data available for analysis.

Unstructured data provides organizations the opportunity to significantly move the needle as it exposes rich unique data sets consisting of images, video, and streaming IoT data that previously has gone untapped in the data analytics space. The bad news is that the data requirements for deep learning (DL) models are quite different than the traditional data that organizations are used to dealing with. For example, most DL use cases are image and video heavy workloads which are huge in data size and virtually incompressible. In my experience, I’ve only been able to get up to 4.6 percent compression using the most extreme lossy algorithm on ImageNet which doesn’t allow much in the way of savings or efficiency when managing billions or trillions of files.   Additionally, the advent of GPUs to handle highly iterative deep learning models from frameworks like TensorFlow have added concurrency as a new infrastructure requirement as files are often read millions of times by a single layer of a convolution neural network (CNN).

These extreme DL requirements for performance, scale, and flexibility don’t neatly conform to traditional block storage boundaries. Luckily, game-changing innovations in unstructured storage as well as computing (rise of the GPU) paired with deep learning networks for image recognition or natural language processing, are finally enabling access to and analysis of data that was hard to divine much insight from a decade ago.

AI infrastructure:  Pandas vs Caviar

As anyone who has taken Andrew Ng’s Deep Learning AI course will tell you, Andrew argues that it’s less about the algorithms and math; he asserts it’s all about the data. Data curation, data engineering, data labeling, and data management of the sprawling infrastructure to support the 100’s of terabytes (TB) to 100’s petabytes (PB) of pictures, videos, and streaming sensor data. Thus, in the AI space, one of the biggest challenges is managing all the unstructured data with the right infrastructure stack to both persist PBs of data and utilized in the data-hungry AI models.

To quote Andrew Ng again, the infrastructure choices are all about pandas and caviar.

The “Raising Pandas” approach refers to the fact many adult pandas babysit and care for one panda baby at a time, or in this case focus on training one model at a time.  For example, if you’re a smaller company or you’re just starting on your AI journey, you might only have the infrastructure and computational capacity to train, measure, and tune a single model at a time.  This is a great solution if you only have one data scientist, if there aren’t not many use cases for deep learning in your organization, or if the model requires a modest amount of data (<100 TB) of data to train the algorithm.

“Caviar refers to laying thousands of eggs and training many models in parallel, then picking the model with the best learning curve.  This approach is often required if petabytes of data are involved in the solution like in the case of autonomous driving and fraud detection. These solutions require a large-scale and complex distributed computing environment to simultaneously training multiple deep learning models to find a best-fit model. In the more advanced cases, these can even be automated and managed with a container-based solution like BlueData to provide an elastic “as-a-service” approach.

Summary

Regardless of your data types and the means of managing value from your data, remember that the companies with the most innovative and actionable DATA win. As your AI requirements mature from pandas to caviar, the infrastructure decisions you make today will have a major impact on your business tomorrow.

 

 By Keith Manthey

Published with permission from https://blog.dellemc.com/en-us/