How to find similar words in your excel table or database

I was reading a post about a person who has a problem mapping data from an excel table to the database. You maybe find it tedious to transfer data between the “cats” fields to the “cat” field.

 

While I'm not an expert in NLP at all. From googling around it can somewhat be done.

First, you want to move your words you have into a separate text file.

If you have past data put them into two separate files. For original data and destination data.

For example:

original data: mDepthTo

destination data: Depth_To

For pre-processing. After that, you want to remove ASCII or miscellaneous characters. And punctuation. So, you want to get rid of a couple of those underscores. To make life easier for yourself turn the data into a uniform case. The NLTK library is good at this.

 

Then after that, you want to encode those words into vectors. Try TF-IDF. Again, you can use it with SK-learn. So, you don’t need to install any extra modules.

A brief explanation of TF-IDF

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents  https://monkeylearn.com/blog/what-is-tf-idf/

 

Now we want to work out the similarity between the vectors. You can use concise similarity as that’s the most common technique. Again, sklearn has this so you can try to out easily.

 

From machine learning plus

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

 

Now we make some progress on word similarity. As you can compare both words in your text files.

For testing, you may want to save some examples. So, you can use for evaluation of the NLP model. Maybe you can create a custom metric for yourself about how closely the model was able to match the destination data.

 

Most of the ideas came from this medium article. Just tried to adapt it to your problem

You should check it out. They know what they are talking about when it comes to NLP.

 

Summary:

1. Save data into separate text files

2. Pre-process the data. (Punctuation, odd characters etc)

3. Encode data with TF-IDF

4. Get word similarity with Cosine similarity

5. Create metric to use. To see if the model maps data correctly.