Python Based Synonym Harvester

Published by Thom Ives on

Also Affectionately Known AsSynonym Roll

Get it on GitHub

Overview

I felt I should interrupt my current flow of posts with posts covering work I’ve had to do for my current professional role that I felt could help many others. This posts covers the first of many such tools for my latest work, which is related to

Aligning phrases found in technical text documents and
associating those phrases to standardized phrase descriptions.

A while back, I could not find a python based synonym engine that would meet my current needs. The ones that did exist were broken. Maybe some of them work now, but the python community has been a great help to my efforts over the years, and this is a chance for me to contribute to our community with an approach that give immediate benefits and be leveraged to other related tasks in the future. I also hope that this can help others in AI development and automation efforts.

As always, I expect that those that can make use of this will clone it from GitHub and refactor it to make it their own and make it serve their specific needs.

Synonym Needs

Other posts related to my current professional development work at the time of this post are using global word vectors (GloVe). I am eager to share that work soon too in the hopes that my learning can help others. As I studied how GloVe was helping my work and also how it was not meeting all my needs, it occurred to me that I would have to roll up my sleeves and develop my own python based synonym engine. No problem. Such efforts just feed my favorite healthy addiction to making my own tools. Even if I don’t use them long term, the insights are helpful beyond measure to other work as I’ve indicated in all my posts. I’d like to point out the myriad of reasons that a python based synonym engine would be valuable, but that would take a large amount of text, AND, if you’ve come here, you already had at least one thought about why it would be useful.

I’m a big fan of Cassie Kozyrkov’s philosophical writings on AI & ML work in general. Where I can struggle to articulate why I make the directional moves in my work that I do, her explanations of the limits and values and needs in our ML & AI work are extremely well worded … and helpful for when I get lost! Check out THIS article of hers. You don’t need an AI for things you can just look up! Word vectors are powerful, but NLP has a long way to go still, so if NLP tools can be improved by making hybrid tool that combines basic lookups AND math machines such as GloVe, I will try that.

The Core Code Functions

In the spirit of what is shared above, synonyms exist. We just want to harvest them as needed and do so as fast as possible. In my case, I even create a JSON based memory module to store only the synonyms that I need for a portion of my expert system AI (ESAI). I am a fan of the python wrapper tools around Selenium (PySelenium), because they help you get around many hurdles in dynamic websites, but they are also slow compared such modules as request and BeautifulSoup. I was relieved to find that I could accomplish harvesting synonyms with basic python modules, and that I would not need PySelenium.

Our next stop is inspections of Thesaurus.com. Go there, and let’s find synonyms for the word barrier. Notice that the URL field now reads “https://www.thesaurus.com/browse/barrier?s=t“. Hmmm. Before we code, let’s a experiment a bit. What if I only submit

https://www.thesaurus.com/browse/barrier

to the URL field. Still works (please be suspicious of me planning ahead in a constructively lazy way even if it may not be true).

Let’s inspect this page, but instead of using RIGHT-CLICK inspect in the browser window, let’s start some foundational coding to help us inspect the code for this page. The code below is in a file named Specific_Syns_1.py that’s stored in the GitHub repo for this post.

import requests
from bs4 import BeautifulSoup
import sys

URL_Front = "https://www.thesaurus.com/browse/"
word = "barrier"

URL = f'{URL_Front}{word}'

page = requests.get(URL)
if page.status_code == 200:
    soup = BeautifulSoup(page.content, 'html.parser')

print(soup.prettify())
 

Python Class Development

For the sake of the length of this post, I recommend that you run this code yourself. You might prefer to write the output to file instead. If so, replace the last line of code “print(soup.prettify())” with the following code.

html_file_text = soup.prettify()

with open('barrier_thesaurus_dot_com.html', 'w', encoding='utf-8') as f:
    f.write(html_file_text)
 

Now, you can inspect the page contents using the power of your favorite IDE / code editor. I’ve saved the code for the barrier page from Thesaurus.com in the file named barrier_thesaurus_dot_com.html, which is in the GitHub repo associated with this post.

At this point, I started to look for clever ways to harvest all the synonyms as efficiently as possible. … FAST FORWARD: After recovering from a whinny fit for this work not going as smoothly as I thought it would go, I got back to work on it with a spirit of belief that there was a slicker way to reach my goals. I started inspecting the code for the “barrier” word page a bit more and did some searches for synonyms that my code was not catching from the page, and VOILA! One of the <script> tags near the bottom of each Thesaurus.com synonym page contains a complete JSON data object that contains all the synonyms for that word and their strength values. HAPPINESS. I only needed to apply a few cleanup lines to the JSON object in that script block to load that JSON text into a workable python dictionary. The final version of that prototype code is below, and this code is in the file Specific_Syns_2.py in the GitHub repo associated with this post. Note that the code below saves the cleaned JSON object to a json file named script_to_data.json. I used this JSON object to study how to harvest the data that I wanted to harvest. The python script then converts the loaded object to a lean dictionary for my, at the time, specific needs to be used later.

import requests
from bs4 import BeautifulSoup
import json
import re

word = "barrier"

URL = f'https://www.thesaurus.com/browse/{word}'

page = requests.get(URL)
if page.status_code == 200:
    soup = BeautifulSoup(page.content, 'html.parser')

the_script = soup.find('script', text=re.compile("window.INITIAL_STATE"))
the_script = the_script.text
the_script = the_script.replace("window.INITIAL_STATE = ", "")
the_script = the_script.replace(':undefined', ':"undefined"')
the_script = the_script.replace(';', '')
data = json.loads(the_script, encoding='utf-8')

with open('test1.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

synonyms = {}
synonyms[word] = {}
for each_tab in data["searchData"]["tunaApiData"]["posTabs"]:
    for syn in each_tab["synonyms"]:
        sim = float(syn["similarity"])
        if sim not in synonyms[word].keys():
            synonyms[word][sim] = []
        synonyms[word][sim].append(syn["term"])

print(synonyms)
 

Experiments And Class Usage

The next step was to turn the above code into a class that I could store amongst my tools and one that would work well in my ESAI code. The code for this class is stored in the file named Syns_Module_Builder.py, which is in the GitHub repo associated with this post. I describe each method of the class following the entire class presented below. Note that when I use __method_name__ type method names, I mean that this class is for internal and private class usage only. It is not intended to be called from other code that instantiates the class.

import os
import requests
from bs4 import BeautifulSoup
import json
import re


class Syns_Module_Builder:
    URL_BASE = 'https://www.thesaurus.com/browse/'

    def __init__(self, base_dir, syn_mod_name, word_list=[]):
        self.base_dir = base_dir
        self.syn_mod_name = syn_mod_name
        self.__build_file_name__()
        self.word_list = word_list
        self.syns_dict = {}

        self.__obtain_syns__()

    def __build_file_name__(self):
        if self.base_dir[-1] != '/':
            self.base_dir += '/'
        self.syn_mod_file_name = f'{self.base_dir}{self.syn_mod_name}.syn'

    def __obtain_syns__(self):
        module_exists = os.path.exists(self.syn_mod_file_name)

        if module_exists:
            with open(self.syn_mod_file_name, 'r', encoding='utf-8') as f:
                self.syns_dict = json.load(f)
            print('USING EXISTING MODULE')
            return self.syns_dict
        else:
            print('CREATING NEW MODULE')
            self.__capture_syns__()
            with open(self.syn_mod_file_name, 'w', encoding='utf-8') as f:
                json.dump(self.syns_dict, f, ensure_ascii=False, indent=4)
            print('NEW MODULE CREATION COMPLETE')
            return self.syns_dict

    def __capture_syns__(self):
        for word in self.word_list:
            URL = f'{Syns_Module_Builder.URL_BASE}{word}'
            page = requests.get(URL)
            if page.status_code != 200:
                continue
            else:
                soup = BeautifulSoup(page.content, 'html.parser')
                the_script = soup.find(
                    'script',
                    text=re.compile("window.INITIAL_STATE"))
                the_script = the_script.text
                the_script = the_script.replace("window.INITIAL_STATE = ", "")
                the_script = the_script.replace(':undefined', ':"undefined"')
                the_script = the_script.replace(';', '')
                data = json.loads(the_script, encoding='utf-8')

                self.syns_dict[word] = {}
                for a_tab in data["searchData"]["tunaApiData"]["posTabs"]:
                    for syns in a_tab["synonyms"]:
                        sim = float(syns["similarity"])
                        syn = syns["term"]
                        self.syns_dict[word][syn] = sim
 

Class Attribute, And __init__ Method

Here’s the top most portion of the class again following the necessary imports …

class Syns_Module_Builder:
    URL_BASE = 'https://www.thesaurus.com/browse/'

    def __init__(self, base_dir, syn_mod_name, word_list=[]):
        self.base_dir = base_dir
        self.syn_mod_name = syn_mod_name
        self.__build_file_name__()
        self.word_list = word_list
        self.syns_dict = {}

        self.__obtain_syns__()
 

Note the URL_Base class attribute that’s available to ALL instantiations of this class, and that is unchangeable. Since this class is designed to specifically harvest synonyms from Thesaurus.com, this front part of the URL word synonyms that we seek will never change UNTIL Thesaurus.com changes it. If they do, we will hope to update the repo accordingly. I hesitate to make this an official pip installable python module, because of it’s dependency on Thesaurus.com.

The __init__ method takes parameters:

  • base_dir – the directory where the synonym data object will be stored,
  • syn_mod_name – the synonym data object name (name.syn), and
  • word_list – the words that we want to associate with synonyms.

In __init__, we attribute our input parameters to the instantiated classes, and do three additional actions:

  • call self.__build_file_name__(), which will build the file name for the stored data (covered below),
  • create an empty dictionary – self.syns_dict = {}, and
  • call self.__obtain_syns__() (covered below)

The last method call is noteworthy. When we instantiate the class, all necessary actions will be taken at that point. There is no intention at this point, due to the way I wanted to use the class, to have methods that are called after class instantiation. This will become more clear as we review the other methods below.

The __build_file_name__ Method

Due to my goal to become EXTREMELY effectively lazy, I’ve started to write more and more methods like this for my classes, so that I can do less low level thinking and work in the future. This private internal method expects that base_dir and syn_mod_name attributes exist and builds the syn_mod_file_name attribute for later usage. Such code is great to leverage from as you develop other classes.

def __build_file_name__(self):
    if self.base_dir[-1] != '/':
        self.base_dir += '/'
    self.syn_mod_file_name = f'{self.base_dir}{self.syn_mod_name}.syn'
 

The __obtain_syns__ Method

If I had to give this private internal method a more specific type name, I would call it an internal private management method. If you’ve ever studied coding principles, this is my attempt to make my classes more cohesive. This method manages high level directions and operations. Due to the ESAI that I am building, I don’t want to recreate memory modules for that expert system everytime I run it. If I’ve created it previously and stored it in a file as a JSON object, I simply load it and use it, and proceed. If I’ve deleted that module, or it’s never yet existed due to this being a new instantiation, I create it and store it. For either case, I return the syn_dict attribute for use. Note that if the synonym memory module does not yet exist, we call the __capture_syns__() method, which we will cover next. Also, please note the encoding=’utf-8′ parameters when dealing with python’s open function. I really love uft-8. The only problem with it is that it’s not always used by everyone, so I work to explicitly move all my files and strings to that format as I go along in my work to avoid transition issues as much as possible. This effort saves MUCH future heart ache!

def __obtain_syns__(self):
    module_exists = os.path.exists(self.syn_mod_file_name)

    if module_exists:
        with open(self.syn_mod_file_name, 'r', encoding='utf-8') as f:
            self.syns_dict = json.load(f)
        print('USING EXISTING MODULE')
        return self.syns_dict
    else:
        print('CREATING NEW MODULE')
        self.__capture_syns__()
        with open(self.syn_mod_file_name, 'w', encoding='utf-8') as f:
            json.dump(self.syns_dict, f, ensure_ascii=False, indent=4)
        print('NEW MODULE CREATION COMPLETE')
        return self.syns_dict
 

The __capture_syns__ Method

We’ve actually already covered the essence of this method above when we went over the code in Specific_Syns_2.py. Now, it’s in the operational structure of our new class. Let’s go over the updates:

  1. We’ve provided a list of words to match with synonyms when we instantiated the class, and the for loop looks for synonyms for all of them.
  2. We build a URL address for the current word to match to synonyms.
  3. We capture the html code from that page.
  4. If the status code returned by requests.get(URL) is not 200, there are no synonyms for that word, and we can go to the next word.
  5. Else, if we did get content, we ask BeautifulSoup to parse the contents, we find our specific <script> tag, we clean up the JSON object to avoid JSON errors, we finally deserialize the JSON and load it to our data variable.
  6. Finally, we loop thru that data dictionary and load it into a dictionary formatted to our specific needs. This is of course where you that will use this module would make specific changes to that reformatting code for your specific needs.
def __capture_syns__(self):
    for word in self.word_list:
        URL = f'{Syns_Module_Builder.URL_BASE}{word}'
        page = requests.get(URL)
        if page.status_code != 200:
            continue
        else:
            soup = BeautifulSoup(page.content, 'html.parser')
            the_script = soup.find(
                'script',
                text=re.compile("window.INITIAL_STATE"))
            the_script = the_script.text
            the_script = the_script.replace("window.INITIAL_STATE = ", "")
            the_script = the_script.replace(':undefined', ':"undefined"')
            the_script = the_script.replace(';', '')
            data = json.loads(the_script, encoding='utf-8')

            self.syns_dict[word] = {}
            for a_tab in data["searchData"]["tunaApiData"]["posTabs"]:
                for syns in a_tab["synonyms"]:
                    sim = float(syns["similarity"])
                    syn = syns["term"]
                    self.syns_dict[word][syn] = sim
 

Closing

One of the many reasons that I started this blog was to pay it forward. Who could possibly repay all the great minds that have come before us for the great things they’ve given us in the form of wisdom, math, science, coding, etc. When I see a hole in a space of knowledge and tools, I feel it’s my turn to put something nice in that hole for those that will come after me. I HOPE that this post will help save a lot of time and effort for those that came to the same point of need. Once I finish this series of posts on how to

Align phrases found in technical text documents and
associating those phrases to standardized phrase descriptions.

I will be eager to get back to my more typical math concepts to complete code type blog posts, which are also a big passion of mine. Until next time …


Thom Ives

Data Scientist, PhD multi-physics engineer, and python loving geek living in the United States.