How to check if your model has a data problem
A couple of times you run your model. And the results are mediocre. While it may be a problem with the model itself. It may also be a problem with your data. If you suspect your model is underperforming because of data.
You can try a few things.
Do you have enough data?
Make sure you have enough data. This is yours to call. This will depend on what type of data you are dealing with. For example, images around 100 can be just enough. Before you add image augmentation. Tabular data, maybe a bit more.
Josh Brownlee mentions that:
The amount of data required for machine learning depends on many factors, such as:
The complexity of the problem, nominally the unknown underlying function that best relates your input variables to the output variable.
The complexity of the learning algorithm, nominally the algorithm used to inductively learn the unknown underlying mapping function from specific examples.
Do you have a balanced dataset?
Does your data consist of one main class? If so, you may want to change that. Having data like that skews the results one way. The model will struggle to learn about other classes. Adding more data from the other classes can help. If you have the issue above.
You could try under-sampling. Which means deleting data points from the majority class. On the flip side, you try can oversampling. Which means simply copying the minority class for more samples.
If your data has a few outliers, you may want to get rid of them. This can be done using Z-score or IQR.
Is your data actually good?
I’m talking about rookie mistakes like blank rows, missing numbers. Which can be fixed with a few pandas operations. Because they tend to be so small, they are easy to miss.
Assuming you are using pandas you can get rid of N/A. You can use the df.dropna().
Do you need some of the columns in your dataset? If not drop them. For example, if you are analysing house prices. Then data like the name of the resident is not a good factor for the analysis. Another example if you're analysing the weather of a certain area. Then dataset with 10 other areas is of no interest you.
To make life easier for yourself. If you are using pandas. Make sure the index is correct. To prevent headaches later on.
Check the data types of your columns. Because they may contain values of different data types. For example, if your column for DATE. Is a text data type. You may want to change that into a pandas date type. For later data manipulation.
Also, a couple of your values may have extra characters forcing them to be a different data type. For example, if one of your columns is a float data type. But one of the values looks like this [9.0??]. Then the value will count as a text data type. Giving you problems later on.
Features in your data
Your dataset may contain bad features. Features engineering will be needed to improve it.
You can start with feature selection. To extract the most useful features.
Do you have useless features like name and ID? If so remove them. That may help.
They are multiple techniques for feature selection. Like Univariate Selection, Recursive Feature Elimination, Principal Component Analysis.
Afterwards, you can try feature extraction. This is done by combining existing features into more useful ones. If you have domain knowledge then you can manually make your own features.
Do the feature scales make sense? For example, if one of your features is supposed to be in the 0 to 1 range. Then having a value that is 100. Means that it’s a mistake. That value will cause the data to skew one way. Due to it being an outlier.
Depending on your data. You can try one shot encoding. This is a great way to turn categorial data into numeric values. Which machine learning models like. You do this by splitting the categorical data into different columns. And a binary value is added to those columns.
Resources:
https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
https://machinelearningmastery.com/feature-extraction-on-tabular-data/
https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be
https://machinelearningmastery.com/data-preparation-for-machine-learning-7-day-mini-course/