Why learning how to use small datasets is useful for ML

Many times, you hear about a large tech company doing something awesome with ML. You think that’s great. You think how can I do the same. So you copy their open-source code. And try it for yourself. Then you notice the result is good but not amazing. Than you first thought. Then you spotted that you are training the model with less than 200 samples. But the tech company is using 1 million examples. So you conclude that’s why the tech company’s model performs well.

So this brings me the topic of the blog post. Most people who are doing ML need to get good at getting good results with small datasets. Unless you are working for a FANG company. You won't have millions of data points at your disposal. As we apply technology to different industries, we must deal with applying models when not much data is available. The lack of data can be because of any reason. For example, this could be that the industry does not have experience using deep learning. So, collecting data is a new process. Or maybe collecting data can be very expensive, due to extra tools or expertise involved. But I think we need to get used to the fact that we are not Google.

We do not have datacentres full of personal data. Most businesses have less than 50 employees. With the company dealing with a few thousand customers. To a few dozen depending on the business. Non-profits may not even have the resources to collect a lot of data. So just getting insights from the data we have. Is super useful. Compared to trying to working with a new cutting edge model. With bells and whistles. Remember your user only cares about the results. So you can use the simplest model. Heck, if a simple feed-forward neural network works then go ahead.

But we should not worry about having the resources of tech companies. But worry about what we can do with the resources we have now. A lot of gains in ML are simply done by throwing ungodly amounts of data at the model. And seeing what happens next. We need to do that with only less than 200 samples. Luckily, people have developed techniques that may help with this. Like image augmentation that edits photos. Which helps the model learn about the image in different orientations. So has a general idea of the object of the image. Regardless of any minor edits like size or direction. Soon we may have GANs, that help produce more data from a small dataset. Which means we can train a model with larger datasets. Thanks to generated data of GANs.

While reading about GPT-3 is fun. And very likely to lead to some very useful products. (Just check out product hunt). We are not going to have the opportunity to train a 10 billion parameter behemoth. Don’t get me wrong, we don’t need to train a large model from scratch to be useful. This is what fine-tuning is for. The people using GPT-3 are not training the model from scratch but using prompts to help prime the model to solve the problems it will be dealing with.

But I think we need to deal with the bread and butter issues that people want an AI to solve. Like a simple image classifier. Which may be useful for a small business which needs it to sort out different weights in its store. Or a simple time series analysis to forecast sales into the next season for the shop keeper. Models from Facebook and google that have 100 layers will not be helpful. And likely give you grey hairs by setting it up. Again, the whole goal is the solution to your customer’s problem. Not to split the atom.

Like a podcast, I heard a while ago. Deep learning is already powerful enough. We just need to get it into the hands of the right people. Hopefully, I can help with that process. To do that we need to be a pragmatist and deal with the restraints that most people have when applying ML to their projects. While I'm happy the researchers in OpenAI and the FANG companies take deep learning to its limits. I want to have on the ground experience of taking that knowledge and improving the world. (Yes, its sounds very hippy). But most people will not have the resources to spent millions of dollars on a cloud provider to train a model. People may have a budget of a few hundred or a few maybe a thousand. But not a few million. With rates of cloud computing, a budget like that should be more than enough. Especially dealing with small models with small datasets.