Trying to scrap my lifting Evernote page

This is my first coding project in a while. The main goal of this project to scrap the data from my Evernote and turn it into stats.

Evernote note OG.png

. As you can see on this image its mainly  with a emojis to track my progress. As the note has a regularly structure  it should not be too complex. So first we want to get a file to where we can use python to extract the information.

export options.png

As the other image shows you there are many options to export a note from Evernote. The option went with was the single HTML page. Evernote has their own custom XML format called enex. This file format allows users to import notes back into Evernote. But trying to decipher this new markup language I thought will lead to extra work on my hands. Especially if I’m going to use tools to are not custom designed for it.

Html_evernote_firefox.png




As this is a html file I’m dealing with I will using the standard web scraping tools mainly beautifulsoup. So decided to preview the html file. The results don’t look to bad. But a quick look at my text editor shows its not that simple.

Html_evernote_OG.png






As you can see the useful information of the note (Date and lifts) are clumped around one line. So scraping this will be not a lot more harder. This first thing I did was import beautifulSoup as this the library I will be using to extract the information from this file. Now to pass the file to the file to the BeautifulSoup parser function we will need to open with with python. Using this simple line file = open("Starting strength log.html", encoding="utf8").

Now we can pass this file to the parser like this soup = BeautifulSoup(file, "html.parser"). Now the beautifulsoup has the file parsed. We can now navigate and explore how we would scrape this file. print(soup.prettify()) will give us a good start as this prints the html as formatted Unicode sting with each tag on separate line making it easier to read.

Evernote prettity soup.png

From looking at the prettify results we can see that the useful bits of the note are within the body tag. And the content itself are within div tags down the hierarchy. Now we can use beautifulsoup to navigate this tree. I used the find() function to search within the body. soup.find("body")

Evernote note body tag.png

Nice now the prompt prints information that its on the note. But the content is still a few levels down. As we will be using this section of the html tree can save the soup find as a variable. bodyhtml = soup.find("body"). Now we can use this variable navigate the different levels of the body tag.

As we can see the content we want are in other tags, mainly under the span tag. So we just simply just want state the tags we want to get under the parse tree. bodyhtml.div.span should do the trick.

Evernote note span tag.png

Now we have zoomed in even more into the parse tree. We find that the other pieces of content are located on more lower levels. So, we need to explore the rest of the content on this level. Mainly this div level most of all the useful content a under a div tag of the same level. Scrolling down the prompt we can see the specific type of information we want.

Evernote note lifting section.png

Simply the date and the weights used on the lift. As the div tags are on the same level we want to navigate the parse tree sideways. We will attach the .next_sibling to the bodyhtml from earlier. So the command for the navigation would like this bodyhtml.div.span.div.next_sibling.

zoom in div tag.png

The prompt shows another div tag on the same level. But we want information further within note. Which are the dates and lifts. As they are on same level on the parse tree we could keep on attaching the .next_sibling commard but that is not practical due to the amount of siblings there are on the parse tree.

We can use the .next_siblings so we can iterate over the div tags in the parse tree. So we would make a loop to print the sibling. So it would look something like this

for sibling in bodyhtml.div.span.div.next_siblings:
    print(sibling)
siblings.png

It has successfully printed the other sibling in the parse tree. So now we found out a successful way to see the lifts and the numbers in the note. But its too clear as we need to get rid of the non-useful html code like “</br>” and the first section note before the lifts. And later use that turn data into a array and a file so it can be read for data analyse.

To be honest this where I’m stuck right now, so the next article should be describing how to solve this issue. Right now im able to turn into a list:

list of note.png

But unwanted items needs to be removed and elements must be turned into strings or numbers.


Tobi Olabode