Logging & Errors in GSSF

Logging tools

Statistics monitoring

GSSF has prototypal integration with vigilant, a tool that maintains a datastore monitoring all instances of (in our case) go-smart-launcher. It can do so across multiple machines and provides a web-based interface for viewing. Note that this is entirely distinct from GSSA, which orchestrates workflows such as GSSF, and is unaware of vigilant monitoring.

Note that vigilant has changed name during the Go-Smart project, from observant. This name is often used in the code and will be replaced as the dependencies are updated. However, as the project is currently vigilant, and seems to be going forward, the comments generally use this name to refer to it.

Monitoring can be configured using the YAML file {INSTALLROOT}/etc/gosmart/vigilant.cfg. Configuration options are as described in the vigilant documentation.

TODO: Update configuration format to JSON, following upstream change.

FIXME: Posting vigilant messages on log_line is currently suspended, as the filling pipe was slowing the rest of the master process. While this needs to be farmed out to a thread properly, for the moment test this functionality by uncommenting the marked line in logger_vigilant.py.

Logpick

The logpick functionality allows individual components of the GSSF workflow to monitor their child process output for certain regular expressions, indicating the start and end of some internal process. The GoSmartComponent class, on which all the components are based, will sum the time spent in this internal task (based on the child's output) and print the total when the subprocess exits. The logpick entries are expressed as a triple in GoSmartComponent.logpick_pairs:

("START_PATTERN", "END_PATTERN", "LABEL")

For example, in the Elmer solver:

("CRS_IncompleteLU: ILU(0) (Real), Starting Factorization", "ComputeChange:", "Solver A")

gives, when the solver finally exits,

Timings (sec resolution):
 --    9 Solver A <'CRS_IncompleteLU: ILU(0) (Real), Starting Factorization' - 'ComputeChange:'>
 --  486 [other]
    ====   
 -- 496

Adding additional logpick entries will, naturally, help account for more of the time used.

Rate limiting

Each component has a member suppress_logging_over_per_second, which may be set to a maximum number of log lines per second from that component or None to be disabled.

Error handling

Exception classes

Very basic unified error types are provided, to help distinguish between user errors, programmer errors, modeller errors and errors of unknown responsibility.

These contain codes which can be matched by the client-side tools. In theory, all errors returned from GSSF will be one of these. Any errors thrown beneath will be caught and wrapped accordingly.

Error Code Error ID Intepretation GSSF exception class (if app.)
SUCCESS 0 All worked -
E_UNKNOWN 1 Error of unknown origin GoSmartError
E_CLIENT 2 Triggered by an issue on the client side, such as illogical input GoSmartClientError
E_SERVER 3 Problems with the server or server-side tools GoSmartServerError
E_MODEL 4 Modelling problem, where the server cannot complete the task for physical/mathematical/numerical/syntactical reasons that are the responsibility of the model developer GoSmartModelError

In general, we err on the side of caution and attribute anything uncertain to E_SERVER or E_UNKNOWN. However, it may be, in the future, that being less conservative with E_MODEL will help provide automatic feedback on issues.

Ideally, in GSSF all errors thrown should be one of the classes listed above, found in gssf.errors. These will be caught by go-smart-launcher and an error file written that GSSA can process and report accordingly. If using GSSF standalone, this will still work properly.

Error regexes

Each line is checked for a per-component error regex, and stored as the subprocess error message if it occurs. Since the subprocess may know nothing about GSSF, such an approach is required to ensure we can provide some feedback, at least, if it crashes.