Recovery

Recovery procedure for AAP

When the configuration of an entire Ansible Automation Platform installation is lost, you can reinstall it and then click everything together by hand, but then it will take weeks and there will be no equal configuration. In very few cases, everything is so well documented that you can snap everything back together in a UI to create the same situation as before the crash. "Configuration As Code" has just been devised and created for this. Now, after the reinstallation, we can run all the configuration as code again. But even that is quite a time-consuming job in a large environment, since all configuration is housed in separate repositories, this can become a time-consuming task to do all this by hand. To prevent this, we have developed this recovery procedure, which reduces manual work to a minimum. A requirement for correct operation is that every organization that is known in RHAAP has a repository in gitlab with the correct name and content. Also, a correct version must be merged with the target environment that needs to be restored, otherwise the recovery of that organization will fail. Steps to a full recovery A complete recovery of an environment consists of the following steps, which should preferably be carried out automatically:
1. Restoring a namespace on Openshift (or a VM)
2. Installing and running the operator for AAP (installation playbook)
3. Restoring the configuration of automation hub using config as code
4. Restoring any custom collections
5. Restore all custom execution environments
6. Obtaining the token (manually in automation hub)
7. Running the config as code of AAP
8. Running the config as code for all teams

The automation of recovery

Of course, it would be useful to be able to start this recovery from a different environment of AAP. In this way, the loss of one environment can be recovered from another, still running environment. Since we have housed all configuration as code in GIT, with pipelines that configure it in AAP without the intervention of other systems. Can we start the recovery by triggering these pipelines in the right order. Since GIT is the executor and not AAP, there is no need to arrange access between AAP environments.

The big picture

In order to restore the environment, we had already seen that the necessary steps must be taken to achieve full recovery. We're not going to go into the recovery of the installation of RHAAP here, in this repository we are going to pick up the recovery from the moment AAP is reinstalled and ready to be configured.

This involves the following (automated) steps:
1. Automation Hub Recovery
2. Obtaining the API token
3. RHAAP proof automated Configuration Recovery
If you want to trigger a pipeline from outside via the gitlab API, you need to create a separate gitlab token for each pipeline and keep it in the play in which you want to run this trigger. That amounts to a lot of administration, where the tokens will also expire. We certainly don't want this, we want to be able to assume that once our code has been started, it will continue until everything is executed. That won't be the case if new tokens are continuously requested. Then there is no longer any real automatic recovery.
To prevent this, it is possible in gitlab to have projects trigger each other, as if it were a dependency, no registered token is needed. A project gets such a token implicitly.this functionality is intended for dependencies, but we are going to abuse it for the recovery process, because it saves a lot of administration. What are we going to do?
1. We are making a new gitlab project for part recovery
2. We write a pipeline with a dependency on the gitlab project we want to execute (or projects)
3. Pushing it to gitlab
4. Waiting for the pipeline to be executed
5. Delete this temporary project

We do this for every recovery step. For the recovery steps, we create job template(s) in RHAAP with a survey for each environment. We only do this for the MGT (management) organization, so that only system administrators have access. We have brought the recovery steps together in a repository, where the steps are recorded in separate playbooks. We are going to explain these here.

Generic data

Both playbooks need some configuration data that defines the environment they need to function in. Since it's almost the same for both, we've merged it into 1 file: env_vars.yml

---
# put your vars in here and make sure this file is ALWAYS vault encrypted
# the values in this file will be encrypted and used in the config files.
gitlab_protocol: 'https://'
# ensure the gitlab_url has the final "/"
gitlab_url: 'gitlab.example.com/'
gitlab_user: <username_gitlab_svc_account>
gitlab_password: <passwd_gitlab_svc_account>

gitlab_group: 'CaC'
aap_base_config: aap-base-config
org_project_name_prefix: aap-config-
ahub_project_name: run_ahub_recovery
aap_project_name: run_aap_recovery

# List repositories to run the pipeline for, including the gitlab group
# name.  
repositories:
  - CaC/cac-ahub-config

  # Examples for additional repositories:
  #- images/ee_cac_image
  #- collections/linux.infra
  #- collections/linux.rhel

controllers:
  dev:
    host: controller.dev.example.com
    passwd: <admin password>
  test:
    host: controller.test.example.com
    passwd: <admin password>
  accp: 
    host: controller.accp.example.com
    passwd: <admin password>
  prod:
    host: controller.prod.example.com
    passwd: <admin password>

Above is the data that makes the operation within the environment possible. First of all, the gitlab environment is defined, the user data is important, in gitlab a user must be created who has sufficient rights to be able to perform this. Creating and using such a user prevents someone from having to use their personal credentials for this, which is of course even more undesirable. In order for this to be secure, this data must be encrypted. Secondly, the definition of the setup of gitlab for config as code. The configuration as code repositories should be housed in this gitlab group. This also allows the code to be kept simple, without too much administration. The repositories variable is an exception to the rule, which is not only about config as code, but also about the organization's 'own' affairs. Self-built collections and execution environments must also be available in the automation hub before you can fully restore the controller. Here we keep track of all those are. This is the only piece of administration that we will have to keep track of. The controllers variable is used to read in the correct configuration, after the base config has been restored, so that only the configured organizations will be restored. hosts.yaml:

all:
  hosts:
    localhost

When we talk about an 'empty' inventory, this is the most basic form. Since ansible requires an inventory, we'll give it to him. This is where the generic part of this repository ends and we are going to deal with the recovery parts.

Recovery code automation hub

Below is the playbook that is used to fully automate the recovery of the automation hub. We are going to functionally chop this playbook into pieces for the explanation, if you put this piece back together you have the full playbook. recover_ahub.yml:

---
- hosts: localhost
  gather_facts: false

  pre_tasks:
    - name: Get vars
      ansible.builtin.include_vars: env_vars.yml

    - name: "Create GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: false
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ ahub_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: present

    - name: Clone the new gitlab repository
      ansible.builtin.git:
        repo: "{{ gitlab_protocol }}{{ gitlab_user }}:{{ gitlab_password }}@{{ 
gitlab_url }}/{{ gitlab_group }}/{{ ahub_project_name }}.git"
        dest: "/tmp/{{ ahub_project_name }}"
        version: "{{ gitlab_env_branch }}"
        clone: true
        update: true
      environment:
        GIT_SSL_NO_VERIFY: true

In the above part of the playbook, we create a git repository and clone it to a temporary directory. Almost all of the variables used in this process come from the env_vars.yml, the only variable that needs to be passed to the playbook is the environment that needs to be restored, in the variable 'gitlab_env_branch'.

  tasks:

    - name: Template the gitlab-ci.yml for all repositories to recover
      ansible.builtin.template:
        src: gitlab-ci-ahub.yml.j2
        dest: "/tmp/{{ ahub_project_name }}/.gitlab-ci.yml"
        mode: '0644'

    - name: Push the updated GitLab repository
      ansible.builtin.shell: |
        git config --global user.name "{{ gitlab_user }}"
        git config --global user.email "{{ gitlab_user }}@example.com"
        git add --all
        git commit -m 'initial config'
        git -c http.sslVerify=false push origin "{{ gitlab_env_branch }}"
      args:
        chdir: "/tmp/{{ ahub_project_name }}"
      changed_when: false

    - name: Delete the tempory directory
      ansible.builtin.file:
        path: /tmp/{{ ahub_project_name }}
        state: absent

    - name: Sleep for 10 sec
      ansible.builtin.pause:
        seconds: 10

The moment the repository is cloned locally, files can be written into it, we are going to do that too, but in this case, only 1 file. That file is the pipeline that will be started when this repository is pushed back to git. Because we use a template that defines our pipeline, we can trigger another repository in this pipeline, as if it were a dependency of this repository. This will cause that pipeline to run for the specified branch. Let that be the 'configuration as code' for the automation hub that we want to restore... All repositories specified in the env_vars.yml will also be triggered, because the template reads them in a loop.
After pushing the modified file(s) in the local version of the repository, the pipeline in git will be triggered by the update and we won't need the local directory anymore, so we'll delete it.
We'll wait 10 seconds to give gitlab a chance to create and launch the pipeline.

    - name: GitLab Post | Obtain Access Token
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}/oauth/token"
        method: POST
        validate_certs: false
        body_format: json
        headers:
          Content-Type: application/json
        body: >
          {
            "grant_type": "password",
            "username": "{{ gitlab_user }}",
            "password": "{{ gitlab_password }}"
          }
      register: gitlab_access_token

    - name: Store the token in var
      ansible.builtin.set_fact:
        token: "{{ gitlab_access_token.json.access_token }}"

    - name: Check the pipeline until it has run
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}/api/v4/projects/{{ gitlab_group }}%2F{{ ahub_project_name }}/pipelines"
        validate_certs: false
        headers:
          Authorization: "Bearer {{ token }}"
      register: _jobs_list
      until: _jobs_list.json[0].status == "success"
      retries: 120
      delay: 20

To make sure that everything is going according to plan and because we want to see the status in the controller, we follow the pipeline so that we can also return the status to the controller.

    - name: "Delete GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: true
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ ahub_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: absent

Finally, we make sure that the recovery project in gitlab is neatly deleted, so that we can also see that the playbook has ended neatly. If the project is still standing, an error has occurred somewhere and the pipeline logs of this project can be used to see which part did not come to a successful conclusion. The only thing we need for execution is a correct template to be able to write the pipeline in the repository.

gitlab-ci-ahub.yml.j2:

# Pull the ansible config as code image
# change this to suit you installation, for a runner on openshift or docker you 
will need an image
# image: localhost:5000/ansible-image:1.0

# List of pipeline stages
stages:
{% for repo in repositories %}
  - {{ repo | lower }}
  - sleep_{{ repo | lower }}
{% endfor %}

{% for repo in repositories %}
{{ repo | lower }}:
  stage: {{ repo |lower }}
  trigger:
    project: {{ repo | lower }}
    branch: {{ gitlab_env_branch }}
    strategy: depend
  when: always

sleep_{{ repo | lower }}:
  stage: sleep_{{ repo |lower }}
  script:
    - sleep 20
  when: always

{% endfor %}

The template as shown above ensures that the specified projects in the repositories variable are called in order, which initiates and executes the pipeline for the said branch. There is a 20-second pause between each call to such a dependent project. This is not always necessary and perhaps environment dependent, in the environment where this was built and tested, it was indeed necessary to take a break between some projects. Since we don't do 'some', it's all.

Obtain API Token

Not all environments allow you to specify the token in advance which should be used in the automation hub, so a manual step has to be inserted. If you could set it during the installation, keep it the same as the token you specified on the gitlab group for the config as code. Then this step is no longer necessary and you can continue entirely on the machine...
The big manual step...
Log in to the newly configured automation hub and generate a new token and commit it to the gitlab group for the Configuration As Code repositories. As a result, the new token will be included in all pipelines that are yet to be executed.
Then we can continue with the playbook below to restore the controller configuration.

Recover Code AAP Configuration

Broadly speaking, this playbook does with the controller for the automation hub has done. However, with a slight change, this playbook doesn't need to tell you much, just where he can find the repositories that meet the naming convention and the basic configuration repository, the rest he figures out himself.
We're going to chop it up again and explain the 'magic' for each part, unfortunately we lose the magic along the way, but hopefully it will be clear what we're doing exactly. Knowing what it does is also being able to recover if things don't go so well. Because the organizations that have been configured cannot be retrieved until the base configuration is loaded into the controller, this playbook is split into two phases. First the basic configuration will have to be loaded.

recover_aap.yml:

---
- hosts: localhost
  gather_facts: false

  tasks:

    - name: Get vars
      ansible.builtin.include_vars: env_vars.yml

    # Reconfigure controller from base config
    - name: "Create GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: false
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ aap_project_name }}_base"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: present

    - name: Clone the new gitlab repository
      ansible.builtin.git:
        repo: "{{ gitlab_protocol }}{{ gitlab_user }}:{{ gitlab_password }}@{{ 
gitlab_url }}/{{ gitlab_group }}/{{ aap_project_name }}_base.git"
        dest: "/tmp/{{ aap_project_name }}_base"
        version: "{{ gitlab_env_branch }}"
        clone: true
        update: true
      environment:
        GIT_SSL_NO_VERIFY: true

As with the recovery of the automation hub, we create a temporary repository and clone it to a temporary directory. This allows us to put the files in locally and then push them to gitlab.

    - name: Template the gitlab-ci.yml for the base configuration
      ansible.builtin.template:
        src: gitlab-ci_base.yml.j2
        dest: "/tmp/{{ aap_project_name }}_base/.gitlab-ci.yml"
        mode: '0644'

    - name: Push the updated GitLab repository
      ansible.builtin.shell: |
        git config --global user.name "{{ gitlab_user }}"
        git config --global user.email "{{ gitlab_user }}@example.com"
        git add --all
        git commit -m 'initial config'
        git -c http.sslVerify=false push origin "{{ gitlab_env_branch }}"
      args:
        chdir: "/tmp/{{ aap_project_name }}_base"
      changed_when: false


    - name: Delete the tempory directory
      ansible.builtin.file:
        path: /tmp/{{ aap_project_name }}_base
        state: absent

    - name: Sleep for 10 sec
      ansible.builtin.pause:
        seconds: 60

Above is the template for the pipeline that will trigger the base configuration as code project as a dependency. After pushing the new files in the gitlab repository, the pipeline will be created. After this, we can delete the temporary folder, so that we leave everything neat. We'll wait a little longer to give gitlab a chance to create and launch the pipeline.

    - name: GitLab Post | Obtain Access Token
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}oauth/token"
        method: POST
        validate_certs: false
        body_format: json
        headers:
          Content-Type: application/json
        body: >
          {
            "grant_type": "password",
            "username": "{{ gitlab_user }}",
            "password": "{{ gitlab_password }}"
          }
      register: gitlab_access_token

    - name: Store the token in var
      ansible.builtin.set_fact:
        token: "{{ gitlab_access_token.json.access_token }}"

    - name: Check the pipeline until it has run
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}api/v4/projects/{{ gitlab_group }}%2F{{ aap_project_name }}_base/pipelines"
        validate_certs: false
        headers:
          Authorization: "Bearer {{ token }}"
      register: _jobs_list
      until: _jobs_list.json[0].status == "success"
      retries: 120
      delay: 20

To make sure that the base configuration is loaded correctly before proceeding, we will monitor the pipeline. If it returns success, we can move on to the part to load the organizations.

    - name: "Delete GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: true
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ aap_project_name }}_base"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: absent

We delete the temporary project after it has run successfully, if it lags, this part has not been completed correctly and the second part will not have started either. In the case of the automation hub, the play was done here, but we still have to read in all the configurations of the created organizations, which we will do fully automatically from now on.

    # Recover all configured orgs
    # Read them from controller
    - name: "Create GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: false
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ aap_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: present

    - name: Clone the new gitlab repository
      ansible.builtin.git:
        repo: "https://{{ gitlab_user }}:{{ gitlab_password }}@{{ gitlab_url }}/{{ gitlab_group }}/{{ aap_project_name }}.git"
        dest: "/tmp/{{ aap_project_name }}"
        version: "{{ gitlab_env_branch }}"
        clone: true
        update: true
      environment:
        GIT_SSL_NO_VERIFY: true

We recreate the temporary repository and clone it to a temporary directory. This allows us to put the files in locally and then push them to gitlab.

    - name: Controller | Read the organizations list
      ansible.builtin.uri:
        url: "https://{{ controllers[gitlab_env_branch]['host'] }}/api/v2/organizations/"
        user: admin
        password: "{{ controllers[gitlab_env_branch]['passwd'] }}"
        method: GET
        body_format: json
        force_basic_auth: true
        validate_certs: false
      register: _controller_organizations

    - name: Get the list of dicts
      ansible.builtin.set_fact:
        organizations: "{{ _controller_organizations.json.results }}"

We log into the controller and see which organizations are listed in it and store the list in a variable organizations. With the list of the configured organizations we are going to perform the following task:

    - name: Template the gitlab-ci.yml for all organizations
      ansible.builtin.template:
        src: gitlab-ci.yml_org.yml.j2
        dest: "/tmp/{{ aap_project_name }}/.gitlab-ci.yml"
        mode: '0644'

    - name: Push the updated GitLab repository
      ansible.builtin.shell: |
        git config --global user.name "{{ gitlab_user }}"
        git config --global user.email "{{ gitlab_user }}@example.com"
        git add --all
        git commit -m 'initial config'
        git -c http.sslVerify=false push origin "{{ gitlab_env_branch }}"
      args:
        chdir: "/tmp/{{ aap_project_name }}"
      changed_when: false

With the list and the template, we create a new pipeline file in the repository and push it to gitlab. This will be used to build the new pipeline.

    - name: Delete the tempory directory
      ansible.builtin.file:
        path: /tmp/{{ aap_project_name }}
        state: absent

    - name: Sleep for 10 sec
      ansible.builtin.pause:
        seconds: 10

We'll remove the temporary clone from the repository locally and wait a little longer before continuing, to give gitlab time to initialize and launch the pipeline.

    - name: GitLab Post | Obtain Access Token
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}oauth/token"
        method: POST
        validate_certs: false
        body_format: json
        headers:
          Content-Type: application/json
        body: >
          {
            "grant_type": "password",
            "username": "{{ gitlab_user }}",
            "password": "{{ gitlab_password }}"
          }
      register: gitlab_access_token

    - name: Store the token in var
      ansible.builtin.set_fact:
        token: "{{ gitlab_access_token.json.access_token }}"

    - name: Check the pipeline until it has run
      ansible.builtin.uri:
        url: "{{ gitlab_protocol }}{{ gitlab_url }}api/v4/projects/{{ gitlab_group }}%2F{{ aap_project_name }}/pipelines"
        validate_certs: false
        headers:
          Authorization: "Bearer {{ token }}"
      register: _jobs_list
      until: _jobs_list.json[0].status == "success"
      retries: 120
      delay: 20

We regularly wait and check the pipeline to see if it is already in the correct state, if it returns success, everything has gone well and the configuration will have been reset for all organizations.

    - name: "Delete GitLab Project in group {{ gitlab_group }}"
      community.general.gitlab_project:
        api_url: "{{ gitlab_protocol }}{{ gitlab_url }}"
        validate_certs: true
        api_username: "{{ gitlab_user }}"
        api_password: "{{ gitlab_password }}"
        name: "{{ aap_project_name }}"
        group: "{{ gitlab_group }}"
        default_branch: "{{ gitlab_env_branch }}"
        shared_runners_enabled: true
        initialize_with_readme: true
        state: absent

Again, we delete the temporary repository if everything went well. This is how we leave everything clean behind us.

Back

Back to Site