Ansible in the Enterprise

Preface

In this document I will try to explain how you could implement ansible in an enterprise organization. The implementation should be painless, according to many experts, but it usually it is not.
Implementing ansible throughout the enterprise is a major change within the organization, the organization has to change together with the implementation or the implementation will fail.
If you have no experience implementing ansible in an enterprise, get someone who does, to help you.

What will I not explain? - ansible best-practices, these should be well known and are well documented - the use of ansible itself
- awx or automation platform (this is a must in an enterprise)

This depends mainly on the previous implementation of any management tooling, like puppet.

I will explain some principals for a successfull implementation of ansible and some pitfalls.

HandsOff-It

A term that is heard often in combination with ansible is "handsoff-it", but what exactly does this mean? There doesn't seem to be a good description of the process anywhere, so everyone is doing something but nobody knows if they are compliant with the description.
I will try to explain this in simple terms.
Be aware that this is not easily achieved, this will require major design changes in architecture, way of working and standardization.
Deviation from standards, must have consequences like exclusion from network access. This will enforce using standards in the enterprise.
These are hard decisions to make, but essential for getting control over the environment.

What is HandsOff-It?

Review the following statements, I will explain these furtheron:
- Once a system is deployed, user logins are no longer allowed
- All changes are made through ansible
- Major changes only through redeployment
- Nobody has sudo or root privileges
- For incidents, there is an escape user (not root) that must be enabled through ansible code
- Code compliancy is checked, not enforced - Ansible login on systems is only possible from rhaap with key authentication and a passphrase.

Once a system is deployed, logins in production are no longer allowed

To ensure that there will be no unauthorized changes to running systems, users that can't login onto the systems, the system should not even be configured with LDAP or AD. When an application needs LDAP or AD authentication, the application must handle this.
This ensures that no user can make unauthorized changes to the system, so it stays in a managed state.
This will be the hardest part to get into your organization, as access to systems is traditionally granted to users and engineers.

All changes are made through ansible

If something needs to be changed on the system, it has to be changed through an ansible playbook.
If this is a permanent change, this has to be implemented by the playbook that has been used to deploy the system (a new version), this way the change is ensured for future deployments of the system.

If the change is a major one, read below.

Major changes only through redeployment

If there are major checnges made to the code that will change the behaviour of the system, this must be applied though a redeployent of the system and the application, this ensures that the system is fully compliant and that there are no old configurationitems left behind.
In testing you can check if the application works with the new configuration, so it will do also in production. There are just a few exceptions to this rule (pets & cattle):
- The data on the server is too big (database servers)
- the server is not redundant (You have work to do)

Nobody has sudo or root privileges

There is no excuse for sudo- or root permissions in acceptance or production.
In development it is kind of mandatory to have these permissions, or there can be no development.
In the testing environment sudo permissions can still be used (if needed), but are no longer
passwordless, these are protected by a daily changing password on the testing systems.
The password can be requested at the automation desk.
THis is not a mandatory step, but ensures that developers know this enironment has more strict change requirements that the development environment and that they have to ensure that changes made, will end up in their code.

For incidents, there is an escape user (not root) that must be enabled through ansible code

We cannot ensure that a system will never fail, thus we will need to be able to login with enough rights to analyse a problem, we haven't fixed in code yet.
Therefore we add a local account to every system, that has sudo all privileges.
This account is lockedout by default and the password is unknown (random). To be able to use this account the service desk needs to run a job-template in rhaap, so the account is unlocked and a known password will be set, that the service desk has entered. The playbook that runs, ensures the account is unlocked, but also will schedule a job to close the account after a given amount of time.
This prevents that passwords become known within the organization and can be misused.

Code compliancy is checked, not enforced

As in every code driven environment, the maturity of the code and systems grows as time passes and development goes on. To prevent that systems age beyond repair, we have to check if the running systems are still within the running parameters.
We will not update these systems with the latest version of the code, this could damage the running application or corrupt data. What we need to do, is check if the system is still compliant with the latest version of the deployment code, if the compliancy is not garanteed, we will redeploy this system with the latest production relase of the ansible code and the application.
The check should be run at least every week or patch cycle for each system, so you know the status of every machine that is controlled through ansible.

In the acceptance and production environment it should only be possible for ansible to login on systems from the rhaap platform. All other source systems must be denied. This prevents playbook executions from other systems that can make unauthorized changes to the managed systems, that aren't logged into central logging.
In the development environment this access is essential to develop the playbooks for production.

Do not copy puppet

There is a big difference in philosophy between ansible and puppet, both have their specific applications.
Choose one- or the other and if you have made your choice (or the management), stick to it, don't mix the functionality.
If you do, it will bite you...

When the choice is made for ansible, this means that you have chosen a deployment tool, that is ment to set the initial state of the system. After that, you don't touch the system by logging in with privileged accounts and change things by hand.
You want to be sure that a fresh deployment of the same system will result in the same functional server/application. The way to reach that goal is to let ansible manage the complete system, without making manual changes.
As puppet would rollback the change in a matter of hours, ansible will not, because there is no agent that will run the code every hour or so.

Do not schedule a job_template every few hours to update the system, just disable access to acceptance and production servers for all users.

System Responsibility

As application developers will be making not only the application, but also the code to deploy- and manage the application, responsibility for system uptime will shift accordingly. In the old days, the linux team would deploy a system and install the application software, guided by an installation manual written by the developers of the application.
Now as the developers are also creating the deployment code to install and run the application, things will change. The linux team will have no knowledge of the application at all and thus cannot take responsibility for running the application.
So the development team is now responsible for the application and for all changes they made to the standard system provided by the linux team's ansible code.
The application developers will deploy their own systems completely, using the ansible code provided by the OS (linux/windows) teams, so the responsibility for the system lies within the development team of the application and no longer with the OS team for hundreds of systems.
The OS team can be consulted in case of incidents or problems, but the first response is from the application team.
This is a major shift in responsibilities for many organizations.

Layers in ansible

You could write a playbook that configures a system in its totality, it would be an enormous playbook that would have a lot of tasks deploying and confiuring the machine, the OS and after that the application.
As you will know, this is no best practice..
So how do we build a system, in a less complex way?
By breaking the whole process into small pieces and glueing them together in automation platform.

So when we take a normal system deployment and break this into the pieces, it looks like this:
- Create network resources (if applicable) - create VM using template
- register VM, udate and configure OS
- install middleware (if needed)
- install application

The networking playbook that will create the networking resources can be driven by variables like:
- Location of the server (datacenter or cloud provider)
- ip range - vlan number

The infrastructure playbook that creates the VM, can be driven by variables that: - determine the location of the VM (local or in a cloud)
- determine the sizing of the storage,cpu and memory
- what OS will be deployed

The next step wil configure the chosen OS for use within the organization:
- logon information (set the AD/LDAP configuration)
- time service
- any other server configuration your organization needs or wants

The next step will install middleware (if needed for the application): - install the packages - configure the middleware to run

The last step install the application and start it: - install the application package
- configure application services
- create directories
- registers the service

These 5 layers can be build in parts and tested.
These can be joined in workflows and/or pipelines for fully functional system deployments.
This way the deployment proces is flexible and tunable for the enterprise.

This is not easily achieved, start small with the infrastructure deploy on local cloud(vmware or something like that), get a feeling for the process. Then start configuring that system to enterprise needs.
Once you can deploy these 2 layers in a workflow to create a functional system, you are ready to deploy applications on these systems.

Test everything

Ensure that everything is tested (code compliance) and functionality:
- Setup your pipeline to ensure code is linted before you can merge it to a branch in GIT.
- Ensure code review is mandatory in GIT
- Test your code (and is results) in a test environment, scheduled through rhaap
- Testing should be devided in destructive tests and non-destructive tests
- Release only tested code to acceptance and production.

Compliancy Checks vs Hardcoded updates

When You have your test playbooks well setup, you can run these in production (only the non-destructive tests) to check if your production still meets compliancy rules.
With the results, you could plan systems in production to be updated/redeployed.

Automate everything

This is the most important of all, automate deployments, testing and updates to systems.
It takes a lot of effort, but will help you in the end.

Why try to find an error, when redeployment is faster.. (only for stateless systems at first). Try to design stateless systems as a commodity and optimze the deployments for speed.

Authors and acknowledgment

Author: Wilco Folkers  
        Senior Automation Specialist  
Date:   12-06-2025