How to extract currency related info from text
I was scrolling through Reddit and a user asked how to extract currency-related text in news headlines.
This is the question:
Hi, I'm new to this group. I'm trying to extract currency related entities in news headlines. I also want to extend it to a web app to highlight the captured entities. For example the sentence "Company XYZ gained $100 million in revenue in Q2". I want to highlight [$100 million] in the headline. Which library can be used to achieve such outcomes? Also note since this is news headlines $ maybe replaced with USD, in that case I would like to highlight [USD 100 million].
While I did not do this before. I have experience scraping text from websites. And the problem looks simple enough that would likely require basic NLP.
So, did a few google searches and found many popular libraries that do just that.
Using spaCy to extract monetary information from text
In this blog post, I’m going to show you how to extract currency info text from data.
I’m going to take this headline I found from google:
23andMe Goes Public as $3.5 Billion Company With Branson Aid
Now by using a few lines of the NLP library of Spacy. We extract the currency related text.
The code was adapted from this stack overflow answer
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('23andMe Goes Public as $3.5 Billion Company With Branson Aid')
extracted_text = [ent.text for ent in doc.ents if ent.label_ == 'MONEY']
print(extracted_text)
['$3.5 Billion']
With only a few lines of code, we were able to extract the financial information.
You will need to have extra code when dealing with multiple headlines. Like storing them a list. And having a for loop doing the extraction of the text.
Spacy is a great library for getting things done with NLP. I don’t consider myself expert in NLP. But you should check it out.
The code is taking advantage of spaCy’s named entities.
From the docs:
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction.
The named entities have annotations which we’re accessing with the code. By filtering the entities to have money type only. We make sure that we are extracting the financial information of the headline.
How to replace currency symbol with currency abbreviation.
As we can see Spacy did a great job extracting the wanted information. So we did the main task.
In the question, the person needed help with replacing the dollar sign with USD. And included highlighting the financial information.
The replacement of the dollar sign is easy. As this can be done with native python functions.
extracted_text[0].replace('$', 'USD ')
USD 3.5 Billion
Now we have replaced the symbol with the dollar abbreviation. This can be done with other currencies that you want.
Highlighting selected text in data
The highlighting of the text moves away from processing data. And more of the realm of web development.
The highlighting of the text. Would require adjusting the person’s web app. To have some extra HTML and CSS attributes.
While I don’t have the know-how to do that.
I can point you to some directions:
Highlight Searched text on a page with just Javascript
https://stackoverflow.com/questions/8644428/how-to-highlight-text-using-javascript
Hopefully, this blog post has helped your situation. And on your way into completing your project.
If you want more stuff like this. Then checkout my mailing list. Where I solve many of your problems straight from your inbox.