Modeling “MAD”ness: Logging Python Machine Learning Model Runs
Model and Dependencies Automatic Logging Routine …
Watch for a link to the sister YouTube Video on the Integrated Machine Learning & AI Channel
Competing in data science / machine learning competitions is great, because they push you to the next levels. However, trying to keep track of all the models that you’ve tried can be tedious.
In the rush to try new things, I don’t always remember to carefully log which model settings go with which results. It’s hard to know where to go when you don’t have good records of where you’ve been.
I often write code to reduce my human errors, so I brought that habit into this realm by writing an automatic model logger. Just in case other data scientists also struggle with this, I wanted to share this tool. The link to MAD on my github repo is …
MAD – “Model and Dependencies” – capture machine learning settings and dependencies
I hope you will clone or download the repo right away and start experimenting with it before going thru the rest of this blog, but you should be able to follow along fine if you can’t do that right now.
Here’s what the MAD class does:
- Create a model_logs directory if it does not exist, and save log files with a time stamp (or not) to that directory using a “changeable” base name.
- Document the machine learning model name and it’s parameters.
- Document the version of python in use (option to not do so).
- Document the pip requirements (option to not do so).
- Document the necessary imports (option to not do so).
- Document any other important notes not covered above, such as how missing values were filled, etc. (option to not do so).
From the GitHub repo, you will find:
- ToolKit.py – houses the MAD class and it’s methods and is painfully documented and pep8’d to death!
- MAD_Test.py – a simple machine learning file that literally does nothing but instantiate a model from an sklearn machine learning class and then creates the magic log of everything you (in the future) or someone else needs to know to replicate your modeling.
- model_logs – a directory holding two complete log files as examples.
Let’s first go over the MAD module and class. One of the imports import inspect may seem a bit unusual. We’ll go over that in detail soon. The import __main__ will also be discussed in some detail.
import os import os.path import sys import inspect import __main__ from datetime import datetime class MAD: """ "Model and Dependencies" - capture machine learning model settings and dependencies: 0) Create a model_logs directory if it does not exist, and save log files with a time stamp (or not) to that directory. 1) Document the machine learning model name and it's non-default parameters. 2) Document the version of python in use. 3) Document the pip requirements. 4) Document the necessary imports. 5) Document any other important notes not covered above, such as how missing values were filled, etc. """ def __init__(self, mod, file_name='model_data.txt', extra_notes='', add_time_stamp=True, **WTL): """Perform setups for auto documentation.
Let’s next go over the __init__ method for the MAD class.
def __init__(self, mod, file_name='model_data.txt', extra_notes='', add_time_stamp=True, **WTL): """Perform setups for auto documentation. The first line captures the filename of the script instantiating this class. The second line captures the locals from the same script in the first step. The if block adds a time stamp to the default, or provided, file_name. The next if block adds a model logs directory IF it does not yet exist. The file open to write context manager then creates a time stamped file, if add_time_step is true, and adds a documentation section captured from each method. Arguments: mod {class instance} -- instance of machine learning class Keyword Arguments: file_name {str} -- the filename used for the log file (default: {'model_data.txt'}) extra_notes {str} -- optional notes providing mode detail (default: {''}) add_time_stamp {bool} -- bool to add timp stamp to model file or not (default: {True}) WTL {kwargs} -- What To Log (WTL) are key word arguments for what to log. The default is to log everything. Pass any py_version=False, pip_requirements=False, imports=False, and/or capture_notes=False to NOT log one or more of these. """ # Section A.1 self.calling_file = __main__.__file__ self.locals = inspect.currentframe().f_back.f_locals if add_time_stamp == True: time_date = datetime.now().strftime("%Y-%m-%d_%H:%M:%S") file_name = file_name.replace( file_name[-4:], '_' + time_date + file_name[-4:]) # Section A.2 if not os.path.exists('model_logs'): os.makedirs('model_logs') # Section A.3 with open('./model_logs/' + file_name, 'w') as self.out_file: self.out_file.write(self.get_model_info(mod)) if (('py_version' not in WTL) or (WTL['py_version'] is True)): self.out_file.write(self.get_python_version()) if (('pip_requirements' not in WTL) or (WTL['pip_requirements'] is True)): self.out_file.write(self.get_pip_requirements()) if (('imports' not in WTL) or (WTL['imports'] is True)): self.out_file.write(self.get_necessary_imports()) if (('capture_notes' not in WTL) or (WTL['capture_notes'] is True)): self.out_file.write( '\n' + '# Extra Notes:\n' + extra_notes)
Take note of the self.calling_file assignment in Section A.1. This line allows us to get the name of the file that will instantiate the MAD class, so that we can use it later to get information from it.
self.calling_file = __main__.__file__
Also take note of the self.locals assignment also in Section A.1. This line of code allows us to get local variables from the file that is instantiating our MAD class. Both of these class variables will be used in later methods.
self.locals = inspect.currentframe().f_back.f_locals
The if block in Section A.1 simply adds a time stamp to the file name base unless you override the default to bypass this feature. The file name base will be: a) the default; or b) one of your choosing.
Section A.2 add a model_logs directory if one doesn’t yet exist. And finally, Section A.3 is used to write the various pieces of information to the log file unless you override them with the WTL key word argument that you send in. Passing in:
- py_version=False,
- pip_requirements=False,
- imports=False, and/or
- capture_notes=False
to NOT log one of these.
Now let’s cover the next three internal methods (meant to be private, but they don’t need to be).
def _get_model_string(self, model): """Internal function that returns a string of the full model call Arguments: :param model: the instance of the model class being used Returns: a string of the full model call """ model_string = str(model).replace('\n', '').replace(' ', '') return model_string def _get_model_name(self, model_string): """Internal function to get the model name Arguments: :param model_string: a string of the instance and model parameters Returns: {str} the model name """ model_name = model_string.split('(')[0] return model_name def _get_model_params_array(self, model_string): """docstring here Arguments: :param model_string: the string of the instance call of the model class Returns: {list} a list of the model instance parameters """ model_name = self._get_model_name(model_string) model_params_array = model_string.replace( model_name, '').replace('(', '').replace(')', '').split(',') return model_params_array
The sklearn libraries are great about allowing you to print model instances of its classes and turn them into strings. For our tool here, this allows us to slice and dice the information in many helpful ways. That’s all that’s going on in these methods to help us get model information.
Now we get to the first meaty method – the get_model_info method.
def get_model_info(self, mod): """Simple method to return a formatted string of model information. Arguments: :param mod: {class instance} -- argument from __init__ method Returns: {str} the model with non default parameters in use """ # Section B.1: get the model string and parameters model_string = self._get_model_string(mod) if 'Pipeline' in model_string: model_string = model_string.replace(',', ',\n\t\t') return '# Model and parameters:\n\t' + model_string + '\n' model_name = self._get_model_name(model_string) model_params = self._get_model_params_array(model_string) # Section B.2a: get the params used in the default instance # of the model default_imports_array = [ x for x in self._get_imports_array() if model_name in x] default_imports_string = "\n".join(default_imports_array) default_exec_command = default_imports_string + ";" + "mod_default=" \ + model_name + "()" exec(default_exec_command, globals(), locals()) # Section B.2b: get the default model string and parameters default_model_string = self._get_model_string(locals()['mod_default']) default_model_params = self._get_model_params_array( default_model_string) # Section B.3: get the list of non default parameters and create # a model string with non default parameters non_default_model_params = [ x for x in model_params if x not in default_model_params] if len(non_default_model_params) == 0: log_model_string = model_name + "()" else: log_model_string = model_name \ + "(\n\t\t" + "\n\t\t".join(non_default_model_params) + ")" return '# Model and parameters:\n\t' + log_model_string + '\n'
The Section B.1 uses the previous three internal methods to get information on the model that we wish to log. Note the short circuit of work when a Pipeline based model is encountered. I’ll talk about this at the end of the post. The Sections B.2a and B.2b sections serve the purpose of capturing the model’s default parameters, so that we can get the “non default parameters” for final reporting in Section B.3. Then the return statement does a bit of formatting to our final string about our model before it’s returned for writing to the log.
The get_python_version method creates a string for logging the python version.
def get_python_version(self): """Simple method to return a string of python version information. """ return '\n# python version:\n' + str(sys.version.split('\n')[0] + '\n')
The next method, get_pip_requirements, is where things get fun! This was a bit of trick to figure out, because we don’t want all the pip installs. We only want the pip installs that are needed for the file that is instantiating our MAD class and is using a model.
def get_pip_requirements(self): """Class method to capture pip requirements for the script. The first code block creates a string out of the local items. The second block does a pip version flexible import of freeze to capture modules loaded by pip. The third block obtains all pip installs that appear in local items also and puts them in a data frame. The fourth block formats a requirements.txt style string of required pip modules needed by the script instantiating this class. Returns: {str} -- a formatted list of modules needing pip installation for the model to work. """ # Section C.1 local_modules_string = str(self.locals.items()) # Section C.2 try: from pip._internal.operations import freeze except ImportError: # pip < 10.0 from pip.operations import freeze # Section C.3 x = freeze.freeze() pip_list = [] for p in x: line = p.split('==') if line[0] in local_modules_string: pip_list.append(line) # Section C.4 pip_rqmts = '\n# pip requirements:\n' for row in pip_list: line_string = row[0] + '==' + row[1] + '\n' pip_rqmts += line_string return pip_rqmts
Section C.1 converts a report of all local items to a string for use in Section C.3. Then, in Section C.2, we do an import of freeze (anticipating potential import issues) to help with collecting all pip installs, at the beginning of Section C.3, that have been made to the current environment. We only append those modules to pip_list that appear in our local items, because this is all that we need to report for the model being used. Finally, Section C.4 the pip_list is converted to a nice string formatted the same way that a pip requirements file would be formatted before returning it to be logged.
The next method, _get_imports_array, is another internal method that captures a list of the imports from the file running the model that we are logging using our MAD instance. It simply loads the file for reading to capture the import statements from the calling file.
def _get_imports_array(self): """Returns an array of import statements from the calling file. """ with open(self.calling_file, 'r') as f: FLA = f.readlines() imprt_lines = [line.rstrip('\n') for line in FLA if ( ('import ' in line) and ('#' != line[0]) and ('MAD' not in line))] return imprt_lines
The last method, get_necessary_imports, which uses _get_imports_array is fairly simple in that it simply takes the array captured from the previous method and formats it so that it can be copied directly to a new python script if necessary.
def get_necessary_imports(self): """Returns a formatted string of imports needed by the script instantiating this class. The first code block reads the file instantiating this class, and filters a list of the file lines to find those containing import statements. The second code block prepares a formatted string for return of the imports captured in the first block. Returns: {str} -- a formatted list of imports from the script of code block one. """ # Section D.1 imprt_lines = self._get_imports_array() # Section D.2 imports_string = '\n# Necessary Imports:\n' for line in imprt_lines: imports_string += line + '\n' return imports_string
That covers the entire MAD class. Now, let’s instantiate it and use it in a simple script that calls performs machine learning with a simple model instantiation. The file below, MAD_Test.py, is also in the repo.
from sklearn.linear_model import LinearRegression from ToolKit import MAD ############################################################################### # ## Model info section model = LinearRegression(normalize=True, copy_X=False) Notes = """Logging the model with TWO non-default parameters and logging all aspects. Removed the py_version, pip_requirements and imports logging this time.""" MAD(model, extra_notes=Notes) # , py_version=False, pip_requirements=False, imports=False) # , capture_notes=False)
Run this script with some variations to the LinearRegression instantiation and changes to the notes and to what you capture in the log to see how it works. The model_logs directory in the git repo has some logs that demonstrate my experiments with changing what I would ask it to report and with parameters deviating from default parameters.
Plans for Future Work
So far, I’ve assumed that I will be able to tell which data files correspond to which log files. I am considering ways to automate this correspondence. I also plan to add a tool that will pre-pend the log file names with the results / scores that would be associated with them.
As of right now, the capture of pip requirements is robust, but the method of capturing imports is “tenuous” at best, and, honestly, I don’t like it. I’d almost like to dump it. It makes moving on from this work a bit inflexible.
One area that needs much more work is the way that this handles Pipeline based models. I’m trying to develop an elegant way to deal with this so that it will only report “non-default” parameters, but this is not there yet. However, I’d rather use this tool in it’s current state than not use it even with it logging all parameters including defaults. I’ll update the github repo with any improvements and note as such in this blog post.