Debugging DataWarehouse triager and issue regexes

How to investigate why a certain test was not triaged in DataWarehouse

Problem

A test was not tagged correctly or an issue regex is not working as expected.

Steps

  1. Install and configure cki-tools via

    git clone https://gitlab.com/cki-project/cki-tools
    cd cki-tools
    python3 -m pip install -e --user .[triager]
    export DATAWAREHOUSE_URL='https://datawarehouse.cki-project.org'
    export CKI_LOGGING_LEVEL=DEBUG
    

    Get a DataWarehouse token if you need to access internal tests or issues in the DataWarehouse, and export it via

    export DATAWAREHOUSE_TOKEN_TRIAGER='token or empty'
    
  2. Check whether a local run of the triager would correctly tag the test via

    $ python3 -m cki.triager.main single test redhat:394452540_s390x_upt_9
    2021-10-25T11:10:26.471 - [INFO] - cki.triager.checkers -  running: check_logs_with_regex
    2021-10-25T11:10:41.105 - [INFO] - cki.triager.checkers -   result: []
    2021-10-25T11:10:41.106 - [INFO] - cki.triager.checkers -  running: check_kickstart_error
    2021-10-25T11:10:41.106 - [INFO] - cki.triager.checkers -   result: []
    2021-10-25T11:10:41.106 - [INFO] - cki.triager.checkers -  overall result: []
    
  3. If that is the case, the problem is most likely related to the communication between the triager and DataWarehouse.

    Check the execution of the triager As-A-Service and search for outstanding problems. In case a test was not triaged, it’s recommended to first be sure it was processed by the service by checking searching the test id through the logs.

    All logs from the DataWarehouse Triager are accessible using Applecrumble. It’s possible to use the Explore feature to search for the logs by generating a LogQL query like the following.

    {deployment="datawarehouse-triager"}
    

    Make sure you are logged in to access the Explore page and to select the Loki data source.

    It’s possible to narrow down the results by filtering the query with details about the thing you are looking for, such as the test ID or issue name.

    {deployment="datawarehouse-triager"} |= "redhat:1234"
    {deployment="datawarehouse-triager"} |= "Storage blktests"
    

    Make sure to select a time span on the top right corner that would match the moment the test should have been processed.

    When an issue is identified, the log output should look similar to these lines:

    2021-10-22T21:32:22.132 - [INFO] - cki.triager.checkers -  running: check_logs_with_regex
    2021-10-22T21:32:24.060 - [INFO] - cki.triager.checkers -   result: [{'name': 'Storage blktests - srp: stuck on srp/005', 'id': 691}]
    2021-10-22T21:32:24.060 - [INFO] - cki.triager.checkers -  running: check_kickstart_error
    2021-10-22T21:32:24.061 - [INFO] - cki.triager.checkers -   result: []
    2021-10-22T21:32:24.061 - [INFO] - cki.triager.checkers -  overall result: [{'name': 'Storage blktests - srp: stuck on srp/005', 'id': 691}]
    2021-10-22T21:32:24.128 - [INFO] - cki.triager.triager - Linking issue id={'name': 'Storage blktests - srp: stuck on srp/005', 'id': 691} to id=redhat:133661518
    

    Given that multiple pods run at the same time, these lines will probably be scrambled between other runs.

  4. In case the test was processed but the failure was not detected, check whether the regex is correctly detecting the issue.

    The following Python script helps you validate the submitted regex against the file where the failure should be present. If your regex pattern requires flags like re.DOTALL, set them inline, as a prefix, e.g. (?s)single.*line.

    import requests
    import re
    
    LOG_URL = 'https://url-to-log-file'
    REGEX = r'regex content'
    
    log_content = requests.get(LOG_URL).content.decode(errors='ignore')
    regex = re.compile(REGEX)
    
    print(regex.search(log_content))
    
  5. If the regex is correctly defined and the snippet can find it correctly, the last step is to debug the triager execution.

Additional steps for manually tagging issues

  1. To enable tagging issues for local runs of the triager, add --no-dry-run similar to

    python3 -m cki.triager.main --no-dry-run single test redhat:394452540_s390x_upt_9