Modeling “MAD”ness: Logging Python Machine Learning Model Runs

Published by Thom Ives on

Model and Dependencies Automatic Logging Routine … 

Find it on GitHub

Watch for a link to the sister YouTube Video on the Integrated Machine Learning & AI Channel

Competing in data science / machine learning competitions is great, because they push you to the next levels. However, trying to keep track of all the models that you’ve tried can be tedious.

In the rush to try new things, I don’t always remember to carefully log which model settings go with which results. It’s hard to know where to go when you don’t have good records of where you’ve been.

I often write code to reduce my human errors, so I brought that habit into this realm by writing an automatic model logger. Just in case other data scientists also struggle with this, I wanted to share this tool. The link to MAD on my github repo is …

MAD – “Model and Dependencies” – capture machine learning settings and dependencies

I hope you will clone or download the repo right away and start experimenting with it before going thru the rest of this blog, but you should be able to follow along fine if you can’t do that right now.

Here’s what the MAD class does:

  1. Create a model_logs directory if it does not exist, and save log files with a time stamp (or not) to that directory using a “changeable” base name.
  2. Document the machine learning model name and it’s parameters.
  3. Document the version of python in use (option to not do so).
  4. Document the pip requirements (option to not do so).
  5. Document the necessary imports (option to not do so).
  6. Document any other important notes not covered above, such as how missing values were filled, etc. (option to not do so).

From the GitHub repo, you will find:

  1. ToolKit.py – houses the MAD class and it’s methods and is painfully documented and pep8’d to death!
  2. MAD_Test.py – a simple machine learning file that literally does nothing but instantiate a model from an sklearn machine learning class and then creates the magic log of everything you (in the future) or someone else needs to know to replicate your modeling.
  3. model_logs – a directory holding two complete log files as examples.

Let’s first go over the MAD module and class. One of the imports import inspect may seem a bit unusual. We’ll go over that in detail soon. The import __main__ will also be discussed in some detail.

import os
import os.path
import sys
import inspect
import __main__
from datetime import datetime


class MAD:
    """ "Model and Dependencies" - capture machine learning model 
            settings and dependencies:

        0) Create a model_logs directory if it does not exist, and 
            save log files with a time stamp (or not) to that directory.
        1) Document the machine learning model name 
            and it's non-default parameters.
        2) Document the version of python in use.
        3) Document the pip requirements.
        4) Document the necessary imports.
        5) Document any other important notes not covered above, 
            such as how missing values were filled, etc.
    """

    def __init__(self, mod, 
        file_name='model_data.txt', extra_notes='', add_time_stamp=True, **WTL):
        """Perform setups for auto documentation.
        

Let’s next go over the __init__ method for the MAD class.

    def __init__(self, mod, 
        file_name='model_data.txt', extra_notes='',
        add_time_stamp=True, **WTL):
        """Perform setups for auto documentation.

        The first line captures the filename of the script
            instantiating this class.
        The second line captures the locals from the same script
            in the first step.
        The if block adds a time stamp to the default, or provided, file_name.
        The next if block adds a model logs directory IF it does not yet exist.
        The file open to write context manager then creates a
            time stamped file, if add_time_step is true, and adds
            a documentation section captured from each method.

        Arguments:
            mod {class instance} --  instance of machine learning class

        Keyword Arguments:
            file_name {str} -- the filename used for the log file
                (default: {'model_data.txt'})
            extra_notes {str} -- optional notes providing mode detail
                (default: {''})
            add_time_stamp {bool} -- bool to add timp stamp to model file
                or not (default: {True})
            WTL {kwargs} -- What To Log (WTL) are key word arguments
                for what to log. The default is to log everything.
                Pass any py_version=False, pip_requirements=False,
                imports=False, and/or capture_notes=False to NOT log one
                or more of these.
        """
        # Section A.1
        self.calling_file = __main__.__file__
        self.locals = inspect.currentframe().f_back.f_locals
        if add_time_stamp == True:
            time_date = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
            file_name = file_name.replace(
                file_name[-4:], '_' + time_date + file_name[-4:])

        # Section A.2
        if not os.path.exists('model_logs'):
            os.makedirs('model_logs')

        # Section A.3
        with open('./model_logs/' + file_name, 'w') as self.out_file:
            self.out_file.write(self.get_model_info(mod))
            if (('py_version' not in WTL) or (WTL['py_version'] is True)):
                self.out_file.write(self.get_python_version())
            if (('pip_requirements' not in WTL) or
                    (WTL['pip_requirements'] is True)):
                self.out_file.write(self.get_pip_requirements())
            if (('imports' not in WTL) or (WTL['imports'] is True)):
                self.out_file.write(self.get_necessary_imports())
            if (('capture_notes' not in WTL) or
                    (WTL['capture_notes'] is True)):
                self.out_file.write(
                    '\n' + '# Extra Notes:\n' + extra_notes)


Take note of the self.calling_file assignment in Section A.1. This line allows us to get the name of the file that will instantiate the MAD class, so that we can use it later to get information from it. 

self.calling_file = __main__.__file__

Also take note of the self.locals assignment also in Section A.1. This line of code allows us to get local variables from the file that is instantiating our MAD class. Both of these class variables will be used in later methods.

self.locals = inspect.currentframe().f_back.f_locals

The if block in Section A.1 simply adds a time stamp to the file name base unless you override the default to bypass this feature. The file name base will be: a) the default; or b) one of your choosing.

Section A.2 add a model_logs directory if one doesn’t yet exist. And finally, Section A.3 is used to write the various pieces of information to the log file unless you override them with the WTL key word argument that you send in. Passing in:

  1. py_version=False,
  2. pip_requirements=False,
  3. imports=False, and/or
  4. capture_notes=False

to NOT log one of these. 

Now let’s cover the next three internal methods (meant to be private, but they don’t need to be).

    def _get_model_string(self, model):
        """Internal function that returns a string of the full model call

        Arguments:
            :param model: the instance of the model class being used
        Returns:
            a string of the full model call
        """
        model_string = str(model).replace('\n', '').replace(' ', '')

        return model_string

    def _get_model_name(self, model_string):
        """Internal function to get the model name
        Arguments:
            :param model_string: a string of the instance and model parameters
        Returns:
            {str} the model name
        """
        model_name = model_string.split('(')[0]

        return model_name

    def _get_model_params_array(self, model_string):
        """docstring here
        Arguments:
            :param model_string: the string of the instance call
                of the model class
        Returns:
            {list} a list of the model instance parameters
        """
        model_name = self._get_model_name(model_string)
        model_params_array = model_string.replace(
            model_name, '').replace('(', '').replace(')', '').split(',')

        return model_params_array


The sklearn libraries are great about allowing you to print model instances of its classes and turn them into strings. For our tool here, this allows us to slice and dice the information in many helpful ways. That’s all that’s going on in these methods to help us get model information. 

Now we get to the first meaty method – the get_model_info method.

    def get_model_info(self, mod):
        """Simple method to return a formatted string of model information.
        Arguments:
            :param mod: {class instance} -- argument from __init__ method
        Returns:
            {str} the model with non default parameters in use
        """
        # Section B.1: get the model string and parameters
        model_string = self._get_model_string(mod)
        if 'Pipeline' in model_string:
            model_string = model_string.replace(',', ',\n\t\t')
            return '# Model and parameters:\n\t' + model_string + '\n'

        model_name   = self._get_model_name(model_string)
        model_params = self._get_model_params_array(model_string)

        # Section B.2a: get the params used in the default instance
        #     of the model
        default_imports_array = [
            x for x in self._get_imports_array() if model_name in x]
        default_imports_string = "\n".join(default_imports_array)
        default_exec_command = default_imports_string + ";" + "mod_default=" \
            + model_name + "()"
        exec(default_exec_command, globals(), locals())
        # Section B.2b: get the default model string and parameters
        default_model_string = self._get_model_string(locals()['mod_default'])
        default_model_params = self._get_model_params_array(
            default_model_string)

        # Section B.3: get the list of non default parameters and create
        #     a model string with non default parameters
        non_default_model_params = [
            x for x in model_params if x not in default_model_params]
        if len(non_default_model_params) == 0:
            log_model_string = model_name + "()"    
        else:
            log_model_string = model_name \
                + "(\n\t\t" + "\n\t\t".join(non_default_model_params) + ")"

        return '# Model and parameters:\n\t' + log_model_string + '\n'


The Section B.1 uses the previous three internal methods to get information on the model that we wish to log. Note the short circuit of work when a Pipeline based model is encountered. I’ll talk about this at the end of the post. The Sections B.2a and B.2b sections serve the purpose of capturing the model’s default parameters, so that we can get the “non default parameters” for final reporting in Section B.3. Then the return statement does a bit of formatting to our final string about our model before it’s returned for writing to the log. 

The get_python_version method creates a string for logging the python version.

    def get_python_version(self):
        """Simple method to return a string of python version information.
        """
        return '\n# python version:\n' + str(sys.version.split('\n')[0] + '\n')

The next method, get_pip_requirements, is where things get fun! This was a bit of trick to figure out, because we don’t want all the pip installs. We only want the pip installs that are needed for the file that is instantiating our MAD class and is using a model.

    def get_pip_requirements(self):
        """Class method to capture pip requirements for the script.

        The first code block creates a string out of the local items.
        The second block does a pip version flexible import of freeze
            to capture modules loaded by pip.
        The third block obtains all pip installs that appear
            in local items also and puts them in a data frame.
        The fourth block formats a requirements.txt style string of required
            pip modules needed by the script instantiating this class.

        Returns:
            {str} -- a formatted list of modules needing pip installation
                     for the model to work.
        """
        # Section C.1
        local_modules_string = str(self.locals.items())

        # Section C.2
        try:
            from pip._internal.operations import freeze
        except ImportError:  # pip < 10.0
            from pip.operations import freeze

        # Section C.3
        x = freeze.freeze()
        pip_list = []
        for p in x:
            line = p.split('==')
            if line[0] in local_modules_string:
                pip_list.append(line)

        # Section C.4
        pip_rqmts = '\n# pip requirements:\n'
        for row in pip_list:
            line_string = row[0] + '==' + row[1] + '\n'
            pip_rqmts += line_string

        return pip_rqmts


Section C.1 converts a report of all local items to a string for use in Section C.3. Then, in Section C.2, we do an import of freeze (anticipating potential import issues) to help with collecting all pip installs, at the beginning of Section C.3, that have been made to the current environment. We only append those modules to pip_list that appear in our local items, because this is all that we need to report for the model being used. Finally, Section C.4 the pip_list is converted to a nice string formatted the same way that a pip requirements file would be formatted before returning it to be logged. 

The next method, _get_imports_array, is another internal method that captures a list of the imports from the file running the model that we are logging using our MAD instance. It simply loads the file for reading to capture the import statements from the calling file.

    def _get_imports_array(self):
        """Returns an array of import statements from the calling file.
        """   
        with open(self.calling_file, 'r') as f:
            FLA = f.readlines()
            imprt_lines = [line.rstrip('\n') for line in FLA if (
                ('import ' in line) and
                ('#' != line[0]) and ('MAD' not in line))]

        return imprt_lines

The last method, get_necessary_imports, which uses _get_imports_array is fairly simple in that it simply takes the array captured from the previous method and formats it so that it can be copied directly to a new python script if necessary. 

    def get_necessary_imports(self):
        """Returns a formatted string of imports needed by the script
            instantiating this class.

        The first code block reads the file instantiating this class, and
            filters a list of the file lines to find those containing
            import statements.
        The second code block prepares a formatted string for return of
            the imports captured in the first block.

        Returns:
            {str} -- a formatted list of imports from the script of
                        code block one.
        """
        # Section D.1
        imprt_lines = self._get_imports_array()

        # Section D.2
        imports_string = '\n# Necessary Imports:\n'
        for line in imprt_lines:
            imports_string += line + '\n'

        return imports_string

That covers the entire MAD class. Now, let’s instantiate it and use it in a simple script that calls performs machine learning with a simple model instantiation. The file below, MAD_Test.py, is also in the repo. 

from sklearn.linear_model import LinearRegression
from ToolKit import MAD

###############################################################################
# ## Model info section

model = LinearRegression(normalize=True, copy_X=False)

Notes = """Logging the model with TWO non-default parameters
    and logging all aspects. Removed the py_version, pip_requirements and
    imports logging this time."""

MAD(model, extra_notes=Notes) 
# , py_version=False, pip_requirements=False, imports=False)
# , capture_notes=False)

Run this script with some variations to the LinearRegression instantiation and changes to the notes and to what you capture in the log to see how it works. The model_logs directory in the git repo has some logs that demonstrate my experiments with changing what I would ask it to report and with parameters deviating from default parameters. 

Plans for Future Work

So far, I’ve assumed that I will be able to tell which data files correspond to which log files. I am considering ways to automate this correspondence. I also plan to add a tool that will pre-pend the log file names with the results / scores that would be associated with them.

As of right now, the capture of pip requirements is robust, but the method of capturing imports is “tenuous” at best, and, honestly, I don’t like it. I’d almost like to dump it. It makes moving on from this work a bit inflexible.

One area that needs much more work is the way that this handles Pipeline based models. I’m trying to develop an elegant way to deal with this so that it will only report “non-default” parameters, but this is not there yet. However, I’d rather use this tool in it’s current state than not use it even with it logging all parameters including defaults. I’ll update the github repo with any improvements and note as such in this blog post. 


Thom Ives

Data Scientist, PhD multi-physics engineer, and python loving geek living in the United States.