======== Executor ======== The Valve Infra executor is the service that coordinates different services to enable time-sharing test machines, AKA devices under test (DUTs). This service can be interacted with using `executorctl`_, our client, and/or our `REST API`_. The executor coordinates the different states for the DUTs, here is a flow chart between the different states. .. _dut_state_machine: .. mermaid:: :align: center :alt: DUT state machine :caption: DUT state machine graph TD subgraph "DUT state machine" START --> |is retired?| RETIRED START --> |is marked ready for service?| QUICK_CHECK QUICK_CHECK --> |Success| IDLE QUICK_CHECK --> |Failed| TRAINING TRAINING --> |Failed| TRAINING TRAINING --> |Success| IDLE RETIRED --> |Activate| QUICK_CHECK IDLE --> |Retire| RETIRED IDLE --> |Job received| QUEUED QUEUED --> RUNNING RUNNING --> IDLE end Let's see what every state of a DUT means: * ``IDLE``: The device is available (but powered down to save energy), waiting for a job. * ``TRAINING``: The device is being tested for boot reliability (20 rounds by default). * ``RETIRED``: The device is undergoing maintenance, and cannot accept jobs. * ``QUICK_CHECK``: The device is verifying that its current configuration matches what is described in the database. * ``QUEUED``: The device has been chosen to execute a job, but the executor isn't ready just yet (expected to last <1s) * ``RUNNING``: The device is running a job. .. _config: Configuration ============= The executor service is configured through the use of environment variables. Here are the relevant options to most deployment: * ``BOOTS_TFTP_ROOT``: Cache folder for boot-related artifacts (default: `/cache/boots/tftp`) * ``BOOTS_DEFAULT_*``: See `Default boot configuration`_ for more details. * ``EXECUTOR_URL``: HTTP url of the executor service, reachable locally and from the test machines (default: `http://ci-gateway`) * ``EXECUTOR_ARTIFACT_CACHE_ROOT``: Folder to use as a cache for the kernel/initrd artifacts used by the jobs (**recommended**, default: None) * ``FARM_NAME``: Name of the test form (**mandatory**, default: None) * ``GITLAB_CONF_FILE``: Path to the gitlab runner configuration file, which will be overridden as new test machines are added to the farm (default: `/etc/gitlab-runner/config.toml`) * ``GITLAB_CONF_TEMPLATE_FILE``: Template to use for the creation of the gitlab runner configuration file (default: `$package_dir/templates/gitlab_runner_config.toml.j2`) * ``MARS_DB_FILE``: Path to the database (default: `/config/mars_db.yaml`) * ``MINIO_URL``: URL to the local minio service, accessible both locally and by test machines (default: `http://ci-gateway:9000``) * ``MINIO_ROOT_USER``: Admin username for the local minio service (default `minioadmin`) * ``MINIO_ROOT_PASSWORD``: Admin password for the local minio service (default `minio-root-password`) * ``PRIVATE_INTERFACE``: Network interface connected to the DUTs' network (default: `private`) * ``SALAD_URL``: URL to the salad service (default: `http://ci-gateway:8005`) * ``SERGENT_HARTMAN_BOOT_COUNT``: How many rounds of testing should be used to qualify a test machine (default: `100`) * ``SERGENT_HARTMAN_QUALIFYING_BOOT_COUNT``: How many successful rounds of testing should be used to qualify a test machine (default: `100`) * ``SERGENT_HARTMAN_REGISTRATION_RETRIAL_DELAY``: How many seconds should be waited after an unsuccessful registration attempt before trying another one (default: `120`) And here are the lower-level options: * ``BOOTS_DISABLE_SERVERS``: Set to a non-empty value to disable netbooting services (DHCP and TFP). (default: None) * ``BOOTS_DHCP_IPv4_SOCKET_NAME``: Name of the socket to use for the DHCP server, as set by systemd's socket activation unit using `FileDescriptorName=` (default: `dhcp_ipv4`) * ``BOOTS_TFTP_IPv4_SOCKET_NAME``: Name of the socket to use for the TFTP server, as set by systemd's socket activation unit using `FileDescriptorName=` (default: `tftp_ipv4`) * ``CONSOLE_PATTERN_DEFAULT_MACHINE_UNFIT_FOR_SERVICE_REGEX``: Automatically tag a DUT as unfit for service if it generates a line matched by this regular expression (default: None) * ``EXECUTOR_HOST``: Binding address for the HTTP service (default: `0.0.0.0`) * ``EXECUTOR_PORT``: Binding port for the HTTP service (default: `80`) * ``EXECUTOR_HTTP_IPv4_SOCKET_NAME``: Name of the socket to use for the HTTP server, as set by systemd's socket activation unit using `FileDescriptorName=` (default: `http_ipv4`). Overrides ``EXECUTOR_PORT``/``EXECUTOR_PORT``. * ``EXECUTOR_REGISTRATION_JOB``: Local path to the registration job (default: `$package_dir/job_templates/register.yml.j2`) * ``EXECUTOR_BOOTLOOP_JOB``: Local path to the registration job (default: `$package_dir/job_templates/bootloop.yml.j2`) * ``EXECUTOR_VPDU_ENDPOINT``: Automatically add a virtual PDU for local testing (format: `host:port`, default: None) * ``MINIO_ADMIN_ALIAS``: Alias set up by the executor to refer to the minio instanced specified by ``MINIO_URL``, ``MINIO_ROOT_USER``, and ``MINIO_ROOT_PASSWORD`` (default: `local`) .. _default_boot_config: Default boot configuration -------------------------- When an unsolicited boot request is received by the executor (eg. an admin added a new test machine), it needs to know which kernel/initrd/cmdline this test machine needs to run in order to complete its registration. Here are the most relevant options: * ``BOOTS_DEFAULT_KERNEL``: Default kernel to use to boot unknown test machines (default: `http://ci-gateway:9000/boot/default_kernel`) * ``BOOTS_DEFAULT_INITRD``: Default initramfs to use to boot unknown test machines (default: `http://ci-gateway:9000/boot/default_boot2container.cpio.xz`) * ``BOOTS_DEFAULT_CMDLINE``: Default kernel command line to use to boot unknown test machines (default: `b2c.container="-ti --tls-verify=false docker://ci-gateway:8002/gfx-ci/ci-tron/machine-registration:latest register" b2c.ntp_peer="ci-gateway" b2c.cache_device=none loglevel=6`) However, since no single kernel/initramfs may be suitable for all the possible DUTs, the executor will look for the most suitable value by checking its environment variables in the following order: 1. ``BOOTS_DEFAULT_${BOOTLOADER}_${ARCH}_${PLATFORM}_[KERNEL|INITRD|CMDLINE]`` 2. ``BOOTS_DEFAULT_${ARCH}_${PLATFORM}_[KERNEL|INITRD|CMDLINE]`` 3. ``BOOTS_DEFAULT_${BOOTLOADER}_${ARCH}_[KERNEL|INITRD|CMDLINE]`` 4. ``BOOTS_DEFAULT_${ARCH}_[KERNEL|INITRD|CMDLINE]`` 5. ``BOOTS_DEFAULT_${BOOTLOADER}_[KERNEL|INITRD|CMDLINE]`` 6. ``BOOTS_DEFAULT_[KERNEL|INITRD|CMDLINE]`` With the variables taking the following values: * ``${BOOTLOADER}``: `IPXE` * ``${ARCH}**``: `I386`, `X86_64`, `ARM32`, `ARM64` * ``${PLATFORM}``: `PCBIOS`, `EFI` Example: The following options specify how to boot x86_64 (PCBIOS or EFI) and ARM64 (EFI-only) test machines. Please note how the same command line is used for all configurations, and how the ARM64 architecture only has a kernel specified for the EFI platform while the same kernel will be served for both the EFI and PCBIOS platforms. * ``BOOTS_DEFAULT_X86_64_KERNEL``: `https://ci-gateway:9000/boot/default_x86_64_kernel` * ``BOOTS_DEFAULT_X86_64_INITRD``: `https://ci-gateway:9000/boot/default_x86_64_initrd` * ``BOOTS_DEFAULT_ARM64_EFI_KERNEL``: `https://ci-gateway:9000/boot/default_arm64_kernel.efi` * ``BOOTS_DEFAULT_ARM64_INITRD``: `https://ci-gateway:9000/boot/default_arm64_initrd` * ``BOOTS_DEFAULT_CMDLINE``: `b2c.container="-ti --tls-verify=false docker://ci-gateway:8002/gfx-ci/ci-tron/machine-registration:latest register" b2c.ntp_peer="ci-gateway" b2c.cache_device=none loglevel=6` .. _executorctl: Executor client - executorctl ============================= The executor client can be found in git under `executor/client `_ and installed with ``pip``. It can be used to queue a job on a DUT from the command line, when its state is IDLE: .. code-block:: bash $ executorctl run -t $machine_tag $/path/to/job/file Here is an extract of the command line for ``executorctl run`` : .. code-block:: bash usage: Executor client run [-h] [-w] [-c CALLBACK] [-t MACHINE_TAGS] [-i MACHINE_ID] [-s SHARE_DIRECTORY] [-j JOB_ID] [-a MINIO_AUTH] [-g MINIO_GROUP] job positional arguments: job Job that should be run options: -h, --help show this help message and exit -w, --wait Wait for a machine to become available if all are busy -c CALLBACK, --callback CALLBACK Hostname that the executor will use to connect back to this client, useful for non-trivial routing to the test device -t MACHINE_TAGS, --machine-tag MACHINE_TAGS Tag of the machine that should be running the job. Overrides the job's target. -i MACHINE_ID, --machine-id MACHINE_ID ID of the machine that should run the job. Overrides the job's target. -s SHARE_DIRECTORY, --share-directory SHARE_DIRECTORY Directory that will be forwarded to the job, and whose changes will be forwarded back to -j JOB_ID, --job-id JOB_ID Identifier for the job, if you have one already. -a MINIO_AUTH, --minio-auth MINIO_AUTH MinIO credentials that has access to all the groups specified using '-g' -g MINIO_GROUP, --minio-group MINIO_GROUP Add the MinIO job user to the specified group. Requires valid credentials specified using '--minio-auth' which already have access this group Examples of job that can be run under vivian can be found at `job_templates`_ .. _job_templates: https://gitlab.freedesktop.org/gfx-ci/ci-tron/-/tree/master/executor/server/src/valve_gfx_ci/executor/server/job_templates *TODO* Properly document the job description and file format .. _executor_rest_api: REST API ======== The executor includes a REST API with various endpoints available. Endpoint ``/duts`` ------------------ Method: GET Lists the available machines and their information (IP address, tags, ...) .. code-block:: bash curl localhost:8000/api/v1/duts Endpoint ``/dut/`` ------------------ Method: POST, PUT Adds a new machine to ``MARS_DB_FILE``, if there is a discovery process on-going it'll use this data to set the PDU and port_id. This endpoint is used from the ``machine_registration.py`` script. Endpoint ``/dut/`` ------------------------------ Method: GET Lists all the information of a selected machine. machine_id is the MAC Address. .. code-block:: bash curl localhost:8000/api/v1/dut/ curl localhost:8000/api/v1/dut/52:54:00:11:22:0a Method: DELETE Remove the machine from the database, and all its associated GitLab runner tokens. .. code-block:: bash curl -X DELETE localhost:8000/api/v1/dut/ .. _patch_dut: Method: PATCH Update one or more of the DUT's editable fields: * ``comment`` (str): Specify a comment about the DUT meant for the farm admins * ``is_retired`` (bool): Tag the DUT as retired/active (see our :ref:`DUT state machine `:) * ``pdu_off_delay`` (float): Number of seconds needed to ensure the machine is fully off * ``ready_for_service`` (bool): Tag the DUT as ready for service (see our :ref:`DUT state machine `:) * ``manual_tags`` (list[str]): Overwrite the manual tags .. code-block:: bash curl -X PATCH localhost:8000/api/v1/dut/52:54:00:11:22:0a \ -H 'Content-Type: application/json' \ -d '{"pdu_off_delay": 10, "comment": "this is an example comment"}' Endpoint ``/duts//boot.ipxe`` ----------------------------------------- Method: GET **TODO:** To be documented. Endpoint ``/dut//quick_check`` ------------------------------------------ Method: GET Returns ``true`` if a quick check of the machine has been queued, ``false`` otherwise. .. code-block:: bash curl localhost:8000/api/v1/dut//quick_check Method: POST Queue a quick check on the machine. No parameters are needed. .. code-block:: bash curl -X POST localhost:8000/api/v1/dut//quick_check Endpoint ``/dut/discover`` -------------------------- Method: GET Shows if there is a discovery process on-going and the data of this discovery: pdu, port_id and start date. .. code-block:: bash curl localhost:8000/api/v1/dut/discover Method: POST Launchs a discovery process, it will boot the machine behind a given PDU/port_id and will put this data in ``discover_data`` to be used by the ``machine_registration.py`` script. .. code-block:: bash curl -X POST localhost:8000/api/v1/dut/discover \ -H 'Content-Type: application/json' \ -d '{"pdu": "VPDU", "port_id": '10'}' If no machines show up, the discovery process will automatically timeout after 150 seconds by default. This value can be specified using the ``timeout`` parameter: .. code-block:: bash curl -X POST localhost:8000/api/v1/dut/discover \ -H 'Content-Type: application/json' \ -d '{"pdu": "VPDU", "port_id": '10', "timeout": '60'}' Method: DELETE Erases all the discovery data, discover_data will be emptied. .. code-block:: bash curl -X DELETE localhost:8000/api/v1/dut/discover Endpoint ``/dut//cancel_job`` ----------------------------------------- Method: POST Cancel the jobs running in a machine. machine_id is the MAC Address. .. code-block:: bash curl -X POST localhost:8000/api/v1/dut//cancel_job curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/cancel_job Endpoint ``/pdus`` ------------------ Method: GET Lists the available PDUS and the list of their port_ids with some information such as label or state. .. code-block:: bash curl localhost:8000/api/v1/pdus Endpoint ``/pdu/`` ---------------------------- Method: GET Lists all the information of a selected PDU .. code-block:: bash curl localhost:8000/api/v1/pdu/ curl localhost:8000/api/v1/pdu/VPDU Endpoint ``/pdu//port/`` ------------------------------------------- Method: GET Lists the information of a port_id: label, min_off_time and state .. code-block:: bash curl localhost:8000/api/v1/pdu//port/ curl localhost:8000/api/v1/pdu/VPDU/port/10 Method: PATCH Turns a port OFF or ON. .. code-block:: bash curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \ -H 'Content-Type: application/json' \ -d '{"state": "on"}' Reserve or un-reserve a port. Use True to reserve, False to un-reserve. .. code-block:: bash curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \ -H 'Content-Type: application/json' \ -d '{"reserved": True}' Endpoint ``/full-state`` ------------------------ Method: GET Provides all the information from the endpoints ``/pdus``, ``/duts``, and ``/dut/discover`` in a single call. Endpoint ``/jobs`` ------------------ Method: POST Used to submit jobs. To be documented. MarsDB ====== MarsDB is the database for all the runtime data of the CI instance: - List of PDUs connected - List of test machines - List of Gitlab instances where to expose the test machines Its location is set using the ``MARS_DB_FILE`` environment variable, and is live-editable. This means you can edit the file directly and changes will be reflected instantly in the executor. Machines can be added to MarsDB by POSTing or PUTing to the ``/api/v1/dut/`` REST endpoint. Fields in the REST API match the ones found in the database, but some fields cannot be set at the creation of the machine for safety reasons as we want to enforce a separation between fields that are meant to be auto-generated and the ones that are meant to be manually-configured (denoted by the ``(MANUAL)`` tag in the DB file description below). The most prominent manual fields are ``pdu`` and ``pdu_port``, which means the a newly-added machine won't be usable until manually associated to its PDU port by manually editing the DB file. An easier solution to enroll a new machine is to use the discovery process by POSTing to the ``/api/v1/dut/discover`` endpoint the ``pdu`` and ``pdu_port_id`` fields. This will initiate the discovery sequence where the executor will turn this port ON, wait for the machine to register itself, then automatically add associate the machine to the PDU port specified in the discovery process. Using the discovery process allows a machine to go through the ``TRAINING`` process without further manual intervention. Here is an annotated sample file, where ``AUTO`` means you should not be modifying this value (and all children of it) while ``MANUAL`` means that you are expected to set these values by editing the DB file manually, or through the ``REST`` interface. All the other values should be machine-generated, for example using the ``machine-registration`` container: .. code-block:: yaml pdus: # List of all the power delivery units (MANUAL) APC: # Name of the PDU driver: apc_masterswitch # The [driver of your PDU](pdu/README.md) config: # The configuration of the driver (driver-dependent) hostname: 10.0.0.2 VPDU: # A virtual PDU, spawning virtual machines driver: vpdu config: hostname: localhost:9191 reserved_port_ids: [] # List of reserved ports in the PDU where no virtual DUT can be added (REST) duts: # List of all the test machines de:ad:be:ef:ca:fe: # MAC address of the machine base_name: gfx9 # Most significant characteristic of the machine. Basis of the auto-generated name ip_address: 192.168.0.42 # IP address of the machine tags: # List of tags representing the machine - amdgpu:architecture:GCN5.1 - amdgpu:family:RV - amdgpu:codename:RENOIR - amdgpu:gfxversion:gfx9 - amdgpu:APU - amdgpu:pciid:0x1002:0x1636 manual_tags: # List of tags that cannot be automatically generated (MANUAL) - freesync_display local_tty_device: ttyUSB0 # Test machine's serial port to talk to the gateway gitlab: # List of GitLab instances to expose this runner on freedesktop: # Parameters for the `freedesktop` GitLab instance token: # Token given by the registration process (AUTO) exposed: true # Should this machine be exposed on `freedesktop`? (MANUAL) runner_id: 4242 # GitLab's runner ID associated to this machine pdu: APC # Name of the PDU to contact to turn ON/OFF this machine (MANUAL/REST) pdu_port_id: 1 # ID of the port where the machine is connected (MANUAL/REST) pdu_off_delay: 30 # How long should the PDU port be off when rebooting the machine? (REST) ready_for_service: true # The machine has been tested and can now be used by users (AUTO/REST) is_retired: false # The user specified that the machine is no longer in use first_seen: 2021-12-22 16:57:08.146275 # When was the machine first seen in CI (AUTO) comment: null # Field used to add a quick note about a DUT for admins (MANUAL/REST) gitlab: # Configuration of anything related to exposing the machines on GitLab (MANUAL) freedesktop: # Name of the gitlab instance url: https://gitlab.freedesktop.org/ # URL of the instance registration_token: # Registration token, as found in your GitLab project/group/instance settings access_token: # A read-only API token, used to verify consistency between the local and gitlab state expose_runners: true # Expose the test machines on this instance? Handy for quickly disabling all machines maximum_timeout: 21600 # Maximum timeout allowed for any job running on our test machines gateway_runner: # Expose a runner that will run locally, and not on test machines token: # Token given by the registration process (AUTO) exposed: true # Should the gateway runner be exposed? runner_id: 4243 # GitLab's runner ID associated to this machine Frequently asked questions -------------------------- * How do I move runners from one GitLab project to another? There are currently no easy ways of doing so currently. The best solution is to call the following command line for every runner in MaRS DB: .. code-block:: bash $ curl -X DELETE "https://gitlab.example.com/api/v4/runners" --form "token=" The executor will periodically check the validity of the tokens, and upon seeing they got deleted, it will re-create them in the new project.