Recovery

Recovery procedure for RHAAP

When the configuration of an entire Ansible Automation Platform installation is lost, you can reinstall it and then click everything together by hand, but then it will take weeks and there will be no garanty that the configuration is the same as when it failed.
In very few cases, everything is so well documented that you can click everything back together in a UI to create the same situation as before the crash. "Configuration As Code" has been devised and created for this.
Now, after the reinstallation, we can run all the configuration as code again. But even that is quite a time-consuming job in a large environment, since all configuration is housed in separate repositories, this can become a time-consuming task to do all this by hand. To prevent this, we have developed this recovery procedure, which reduces manual work to a minimum.
A requirement for correct operation is that every organization that is known in RHAAP has a repository in gitlab with the correct name and content. Also, a correct version must be merged with the target environment that needs to be restored, otherwise the recovery of that organization will fail.

Steps to a full recovery

A complete recovery of an environment consists of the following steps, which should preferably be carried out automatically:
1. Restoring a namespace on Openshift (or a VM)
2. Installing and running the operator for RHAAP (installation playbook)
3. Restoring the configuration of rhaap base using config as code
4. Restoring any custom collections
5. Restore all custom execution environments
6. Running the config as code for all teams

The automation of recovery

Of course, it would be useful to be able to start this recovery from a different environment of AAP. In this way, the loss of one environment can be recovered from another, still running environment. Since we have housed all configuration as code in GIT, with pipelines that configure it in AAP whithout the intervention of other systems.
Can we start the recovery by triggering these pipelines in the right order. Since GIT is the executor and not AAP, there is no need to arrange access between AAP environments.

The big picture

In order to restore the environment, we had already seen that the necessary steps must be taken to achieve full recovery. We're not going to go into the recovery of the installation of RHAAP here, in this repository we are going to pick up the recovery from the moment AAP is reinstalled and ready to be configured.

This involves the following (automated) steps:
1. Automation Hub Recovery
2. RHAAP proof automated Configuration Recovery
If you want to trigger a pipeline from outside via the gitlab API, you need to create a separate gitlab token for each pipeline and keep it in the play in which you want to run this trigger. That amounts to a lot of administration, where the tokens will also expire. We certainly don't want this, we want to be able to assume that once our code has been started, it will continue until everything is executed. That won't be the case if new tokens are continuously requested. Then there is no longer any real automatic recovery.
To prevent this, it is possible in gitlab to have projects trigger each other, as if it were a dependency, no registered token is needed. A project gets such a token implicitly.this functionality is intended for dependencies, but we are going to abuse it for the recovery process, because it saves a lot of administration.
What are we going to do?
1. We are making a new gitlab project for part recovery
2. We write a pipeline with a dependency on the gitlab project we want to execute (or projects)
3. Pushing it to gitlab
4. Waiting for the pipeline to be executed
5. Delete this temporary project

We do this for every recovery step.
For the recovery steps, we create job template(s) in RHAAP with a survey for each environment. We only do this for the MGT (management) organization, so that only system administrators have access.
We have brought the recovery steps together in a repository, where the steps are recorded in separate playbooks. We are going to explain these here.

Generic data

Both playbooks need some configuration data that defines the environment they need to function in. Since it's almost the same for both, we've merged it into 1 file: env_vars.yml

---
# put your vars in here and make sure this file is ALWAYS vault encrypted
# the values in this file will be encrypted and used in the config files.
gitlab_protocol: 'https://'
# ensure the gitlab_url has the final "/"
gitlab_url: 'gitlab.homelab/'
rhaap_project_name: run_gateway_recovery
gitlab_group: cac_25
gitlab_user: <user>
gitlab_password: <passwd>
cac_base_config: rhaap_base
org_project_name_prefix: rhaap_cac_

# List repositories to run the pipeline for, including the gitlab group
# name. This list is populated with:
# - the base configuration for rhaap gateway and hub
# - ee environments
# - collections
# to be loaded into the hub.
repositories:
  - cac_25/rhaap_base
  - images/ee_cac_image
  - collections/example.coll_1
  - collections/example.coll_2

  # Examples for additional repositories:
  # - images/ee_cac_image
  # - collections/shs.infra
  # - collections/shs.rhel

controllers:
  dev:
    host: <dev_env_fqdn>
    admin_pw: <password>

  prod:
    host: <prod_env_fqdn>
    admin_pw: <some_passwd>

Above is the data that makes the operation within the environment possible. First of all, the gitlab environment is defined, the user data is important, in gitlab a user must be created who has sufficient rights to be able to perform this. Creating and using such a user prevents someone from having to use their personal credentials for this, which is of course even more undesirable. In order for this to be secure, this data must be encrypted.
Secondly, the definition of the setup of gitlab for config as code. The configuration as code repositories should be housed in this gitlab group. This also allows the code to be kept simple, without too much administration.
The repositories variable is an exception to the rule, which is not only about config as code, but also about the organization's 'own' affairs. Self-built collections and execution environments must also be available in the automation hub before you can fully restore the controller. Here we keep track of all those are. This is the only piece of administration that we will have to keep track of.
The controllers variable is used to read in the correct configuration, after the base config has been restored, so that only the configured organizations will be restored.
hosts.yaml:

all:
  hosts:
    localhost

When we talk about an 'empty' inventory, this is the most basic form. Since ansible requires an inventory, we'll give it to him. This is where the generic part of this repository ends and we are going to deal with the recovery parts.

Recovery code rhaap_base config

Below is the playbook that is used to fully automate the recovery of the gateway and the automation hub.
recover_gateway.yml:

This playbook doen't do the actual work, it creates a gitlab project with a pipeline that will trigger the pipelines of the configured projects to run their configurationas code.
This wil just reuse the configuration that was running last time the environment was reconfigured, so this should always work.

recover_rhaap.yml

---
- name: Recover automation platform configuration
  hosts: localhost
  connection: local
  gather_facts: false

  pre_tasks:
    - name: Get vars
      ansible.builtin.include_vars: "env_vars.yml"

    - name: "Create GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: false
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ rhaap_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: present

    - name: Clone the new gitlab repository
      ansible.builtin.git:
        repo: "{{ gitlab_protocol }}{{ gitlab_user }}:{{ gitlab_password }}@{{ gitlab_url }}/{{ gitlab_group }}/{{ rhaap_project_name }}.git"
        dest: "/tmp/{{ rhaap_project_name }}"
        version: "{{ gitlab_env_branch }}"
        clone: true
        update: true
      environment:
        GIT_SSL_NO_VERIFY: 'true'

  tasks:

    - name: Template the gitlab-ci.yml for all organizations
      ansible.builtin.template:
        src: gitlab-ci-rhaap.yml.j2
        dest: "/tmp/{{ rhaap_project_name }}/.gitlab-ci.yml"
        mode: '0644'

    - name: Push the updated GitLab repository     # noqa: command-instead-of-module
      ansible.builtin.shell: |
        git config --global user.name "{{ gitlab_user }}"
        git config --global user.email "{{ gitlab_user }}@example.com"
        git add --all
        git commit -m 'initial config'
        git -c http.sslVerify=false push origin "{{ gitlab_env_branch }}"
      args:
        chdir: "/tmp/{{ rhaap_project_name }}"
      changed_when: false

    - name: Delete the tempory directory
      ansible.builtin.file:
        path: /tmp/{{ rhaap_project_name }}
        state: absent

    - name: Sleep for 10 sec
      ansible.builtin.pause:
        seconds: 60

    - name: GitLab Post | Obtain Access Token
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}/oauth/token"
        method: POST
        validate_certs: false
        body_format: json
        headers:
          Content-Type: application/json
        body: >
          {
            "grant_type": "password",
            "username": "{{ gitlab_user }}",
            "password": "{{ gitlab_password }}"
          }
      register: gitlab_access_token

    - name: Store the token in var
      ansible.builtin.set_fact:
        token: "{{ gitlab_access_token.json.access_token }}"

    - name: Check the pipeline until it has run
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}/api/v4/projects/{{ gitlab_group }}%2F{{ rhaap_project_name }}/pipelines"
        validate_certs: false
        headers:
          Authorization: "Bearer {{ token }}"
      register: _jobs_list
      until: _jobs_list.json[0].status == "success"
      retries: 120
      delay: 20

    - name: "Delete GitLab Project from group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: true
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ rhaap_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: absent

    # Reload all organizations
    - name: "Create GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: false
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ rhaap_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: present

    - name: Clone the new gitlab repository
      ansible.builtin.git:
        repo: "{{ gitlab_protocol }}{{ gitlab_user }}:{{ gitlab_password }}@{{ gitlab_url }}/{{ gitlab_group }}/{{ rhaap_project_name }}.git"
        dest: "/tmp/{{ rhaap_project_name }}"
        version: "{{ gitlab_env_branch }}"
        clone: true
        update: true
      environment:
        GIT_SSL_NO_VERIFY: 'true'

    - name: Controller | Read the organizations list
      ansible.builtin.uri:
        url: "https://{{ controllers[gitlab_env_branch]['host'] }}/api/v2/organizations/"
        user: admin
        password: "{{ controllers[gitlab_env_branch]['admin_pw'] }}"
        method: GET
        body_format: json
        force_basic_auth: true
        validate_certs: false
      register: _controller_organizations

    - name: Get the list of dicts
      ansible.builtin.set_fact:
        organizations: "{{ _controller_organizations.json.results }}"

    - name: Template the gitlab-ci.yml for the controler configuration
      ansible.builtin.template:
        src: gitlab-ci-organizations.yml.j2
        dest: "/tmp/{{ rhaap_project_name }}/.gitlab-ci.yml"
        mode: '0644'

    - name: Push the updated GitLab repository     # noqa: command-instead-of-module
      ansible.builtin.shell: |
        git config --global user.name "{{ gitlab_user }}"
        git config --global user.email "{{ gitlab_user }}@example.com"
        git add --all
        git commit -m 'initial config'
        git -c http.sslVerify=false push origin "{{ gitlab_env_branch }}"
      args:
        chdir: "/tmp/{{ rhaap_project_name }}"
      changed_when: false

    - name: Delete the tempory directory
      ansible.builtin.file:
        path: /tmp/{{ rhaap_project_name }}
        state: absent

    - name: Sleep for 10 sec
      ansible.builtin.pause:
        seconds: 60

    - name: GitLab Post | Obtain Access Token
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}oauth/token"
        method: POST
        validate_certs: false
        body_format: json
        headers:
          Content-Type: application/json
        body: >
          {
            "grant_type": "password",
            "username": "{{ gitlab_user }}",
            "password": "{{ gitlab_password }}"
          }
      register: gitlab_access_token

    - name: Store the token in var
      ansible.builtin.set_fact:
        token: "{{ gitlab_access_token.json.access_token }}"

    - name: Check the pipeline until it has run
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}api/v4/projects/{{ gitlab_group }}%2F{{ rhaap_project_name }}/pipelines"
        validate_certs: false
        headers:
          Authorization: "Bearer {{ token }}"
      register: _jobs_list
      until: _jobs_list.json[0].status == "success"
      retries: 120
      delay: 20

    - name: "Delete GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: true
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ rhaap_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: absent

Finally, we make sure that the recovery project in gitlab is neatly deleted, so that we can also see that the playbook has ended neatly. If the project is still standing, an error has occurred somewhere and the pipeline logs of this project can be used to see which part did not come to a successful conclusion. The only thing we need for execution is a correct template to be able to write the pipeline in the repository.

templates/gitlab-ci-rhaap.yml.j2:

This is the template that is used in the previous playbook to template out the pipeline into the new recovery project.
This generates a pipeline to trigger all related projects.
Update the tag and image to match your runner configuration:

# Pull the ansible config as code image
# change this to suit you installation, for a runner on openshift or docker you will need an image
image: docker.homelab:5000/cac-image:1.3

# List of pipeline stages
stages:
{% for repo in repositories %}
  - {{ repo | lower }}
  - sleep_{{ repo | lower }}
{% endfor %}

{% for repo in repositories %}
{{ repo | lower }}:
  tags:
    - shared
  stage: {{ repo |lower }}
  trigger:
    project: {{ repo | lower }}
    branch: {{ gitlab_env_branch }}
    strategy: depend
  when: always

sleep_{{ repo | lower }}:
  tags:
    -shared
  stage: sleep_{{ repo |lower }}
  script:
    - sleep 20
  when: always

{% endfor %}

The template as shown above ensures that the specified projects in the repositories variable are called in order, which initiates and executes the pipeline for the said branch. There is a 20-second pause between each call to such a dependent project.
This is not always necessary and perhaps environment dependent, in the environment where this was built and tested, it was indeed necessary to take a break between some projects. Since we don't do 'some', it's all.

templates/gitla-ci-organizations.yml.j2

# Pull the ansible config as code image
# If the gitlab runner is on Openshift or Docker configure the correct pipeline image here
image: docker.homelab:5000/cac-image:1.3

# List of pipeline stages
stages:
  - controller_base_config
  - sleep_controller_base_config
{%- for org in organizations %}
{% if org.name | lower != 'ssc-campus' %}
  - {{ org.name | lower }}
  - sleep_{{ org.name | lower }}
{% endif %}
{% endfor -%}

controller_base_config:
  tags:
    - shared
  stage: controller_base_config
  trigger:
    project: {{ gitlab_group }}/{{ controller_base_config }}
    branch: {{ gitlab_env_branch }}
    strategy: depend
  when: always

sleep_controller_base_config:
  tags:
    - shared
  stage: sleep_controller_base_config
  script:
    - sleep 20
  when: always

{% for org in organizations %}
{% if org.name | lower != 'ssc-campus' %}
{{ org.name | lower }}:
  tags:
    -shared
  stage: {{ org.name |lower }}
  trigger:
    project: {{ gitlab_group }}/{{ org_project_name_prefix }}{{ org.name.split('_')[1] | lower }}
    branch: {{ gitlab_env_branch }}
    strategy: depend
  when: always

sleep_{{ org.name | lower }}:
  tags:
    - shared
  stage: sleep_{{ org.name |lower }}
  script:
    - sleep 20
  when: always

{% endif %}
{% endfor %}

Obtain API Token

Not all environments allow you to specify the token in advance which should be used in the automation hub, so a manual step has to be inserted. If you could set it during the installation, keep it the same as the token you specified on the gitlab group for the config as code. Then this step is no longer necessary and you can continue entirely on the machine...
The big manual step...
Log in to the newly configured automation hub and generate a new token and commit it to the gitlab group for the Configuration As Code repositories. As a result, the new token will be included in all pipelines that are yet to be executed.
Then we can continue with the playbook below to restore the controller configuration.

Back

Back to Site