Before you start designing an application architecture for any cloud, you need to start from a consideration of the main common quality attributes of the cloud:
- Scalability is a capability to adjust a system capacity based on the current needs. For example, let’s say you’re developing an internet shop. You know that before Christmas, the number of orders will grow significantly and you need additional resources in order to handle every request. At the same time, during periods of normal operation, you don’t need as many resources.
- Availability is the time when the system is functional and working.
- Monitoring. In general, you need to think of the cloud as a remote data center. At the same time, you need to take into account that you may have a great number of servers and also consider the dynamic nature of the cloud.
- Security is the capability of the system to prevent data loss, information leaks, and unauthorized usage.
- Cost. You have unlimited access to the compute, storage and network resources and pay as you go, and any resources have their own cost. If you can’t use these resources wisely, you may pay a high cost.
- Time-to-market is the time that is required to deliver your service to customers.
All these challenges are input for your architecture design. In your architecture, you need to provide answers on how you’re going to respond to these challenges. In order to do this, you need to apply architecture tactics that give you a vision on how you’re going to achieve defined system quality attributes.
The key driver of architecture decisions, in most cases, is time-to-market. Business always dictates when the system needs to go live in order to archive an organization’s goals. To achieve time-to-market, you need to consider the following things:
- Build versus buy. It is always attractive for software engineers to create something cool and brand new from scratch. However, it can require a significant amount of effort, and there is no guarantee that your solution will be successful. Before you start a design solution, you need to check if something ready to use is available on the market. If there is an option that works for you, always choose a ready-to-use solution. This decision can significantly reduce implementation efforts and often solution cost.
Here’s a simple example: your solution requires a MySQL database. In this case, you may definitely deploy a cloud solution. Let’s assume that it is AWS. However, if your own MySQL server is on Infrastructure as a Service (IaaS), you need to spend time on deployment and configuration scripts. You need to worry about server availability, security, backups and disaster recovery. You may easily spin up a new RDS DB instance with a MySQL engine. Yes, it will be preconfigured by Amazon and may have some limitations, but you spend five minutes to deploy a MySQL server, instead of several days to develop deployment scripts.
- Development environment. If your development team can’t use an environment that’s pretty close to the production configuration, it is a road to a great number of defects, low development velocity and an unpredictable delivery date. When you design a solution, you need to choose a technology stack and services that each developer in your team may use to work on tasks. Developers need to have isolated environments to work independently. As a fast solution, you may give unlimited access to the cloud to the development team. This approach works fine, however there is a negative side—cloud resources are expensive. The infrastructure as code approach is the best option.
Your development team may use Vagrant or Docker in order to have the ability to run the system locally on the workstation. You need to carefully choose your system components, which can be run locally or used in the cloud only. For example, if we’re talking about a database, you have options to choose between DynamoDB and MySQL (for instance). DynamoDB is great. However, you can’t deploy it locally and you need to think about how to create in isolated development environments. MySQL is pretty well known by software engineers, and it can be deployed locally. Which is the best? There is no right answer to this question because you need to choose the one that is best for you and your team, taking into account your goals, team expertise, etc.
There are two ways to manage your system capacity:
- You may add more CPU, memory and storage to a server—vertical scalability.
- You may add more servers if you need more processing power or reduce the number of servers when you don’t have a need to keep a full fleet of servers. This approach is called horizontal scalability.
In order to achieve scalability and elasticity of your infrastructure, follow these recommendations:
- Don’t think about servers as static resources. Servers in the cloud can be easily terminated, replaced or added. It is often too expensive to recover a server in the cloud. If you have the right design and implementation, you can terminate a server and replace it with a new one. In order to do this, you have the following options:
- Golden images. You may create server images that have an installation of the specific version of your system. They can be virtual machine or Docker images.
- Bootstrap scripts. When a new virtual machine is starting, it runs a bootstrap script that installs all required configuration on the machine. For this approach, you may use a Chef, Puppet, Ansible or even bash scripts.
- If you want to archive horizontal scalability, you need to design stateless components.
- If for some reason you have a component that has to keep a state, use a vertical scalability approach. A good example of a component that keeps state and can be vertically scaled is a MySQL database server.
- Vertical scalability may have limits too. For example, you may give your application a server with multiple CPU and several gigabytes of RAM, however your application needs to be able to consume all of these resources.
- Design a system with loose coupling components and use messaging for communication between components.
The general strategy for an architecture design for the cloud is to design for failure. You need to consider that cloud services and third-party services can sometimes be unavailable. The cloud provider may terminate your service in order to move it on another server rack without any notification in case of a hardware failure. General tactics to achieve high availability are:
- Reduce single points of failure. You need to assume a reductant number of servers.
- Distribute your servers and services between different geographic locations.
- Use messaging for communication between components. It can guarantee that you will not lose a message when a component is unavailable.
If you have just one server or even a few servers, you may get system metrics. Check an application status or check logs. However, what do you do if you have hundreds of servers and the servers can be added or replaced at any point in time?
- Choose a system monitoring solution. Oftentimes, cloud providers supply some basic functionality of the box. However, if it is not enough, you may choose between different monitoring solutions available on the market.
- Aggregate logs and store on reliable storage. To perform troubleshooting and incidents investigation in the elastic environment, you need to publish logs on a remote server and have the ability to perform a log search. If you want to build your own log aggregation solution, Elastic is the de facto standard. you can chose to deploy the native ELK stack or opt for Graylog If you prefer to use SaaS, you may choose between Sumo Logic, Loggly or any other service available on the market.
- Implement health checks endpoints for each system component. The health check endpoint should return a status of OK or Failure. If the service uses external dependencies like a database or remote services, it is a good idea to show a dependency status.
Security is always important. First, you need to decide how you’re going to manage access to the cloud. Each cloud provider supplies their own solution for Identity and Account Management (IAM). The solution gives you the ability to configure granular access to cloud resources. Then, you may apply the following recommendations to your solution:
- Encrypt communication between service components.
- Encrypt all sensitive information.
- Grant the minimum required access level to the cloud services for your application.
- Design a solution to rotate encryption keys and credentials.
- Use virtual private networks and expose only public endpoints to the internet.
- Always check cloud pricing and available discounts to achieve cost optimization.
- Horizontal scalability is a powerful tool for cost optimization. However, it is not a silver bullet. In some cases, it makes sense to choose a bigger server. For example, you might do this if you get a discount if you pay in advance for larger server. You also may have a hybrid solution, when you have one big server that handles an average load and adds a smaller one if you need to increase system capacity.
- Shut down unused resources. If you know that a minimum number of users access the system at night, you may shut down and spin up servers by schedule or any other metric, available in the cloud.
Benchmark a cloud platform(s)
After you’ve defined the quality attributes of your future or legacy system, next step is to choose a cloud provider that is the best for your needs. Yes, you should test a cloud provider and check if they can meet the quality attributes that you defined. A couple of standard areas you need to consider are:
- Presence in different regions. It is always a good idea to deploy your infrastructure as close to consumers as possible. Another point, that if you worry about high-availability of your service, you definitely need to deploy the infrastructure on the multiple sites.
- Check if they have accelerators to help you deploy and deliver your solution faster. It can be a platform as service, managed services, data migration services, and even virtual machines of the right specification. If you can use a managed service provided out of the box, you can save on the operation’s efforts.
- Network performance. If your solution requires a specific networking performance, always perform a benchmark of the cloud network. Each cloud provider has their own vision on what network performance and quality is the best for their target audience.
Use DevOps approach and implement an infrastructure as code
In a traditional data center, in most cases, you work with a fixed amount of servers. So whenever you need to add a new machine or decommission an existing one, it is a very expensive and tough process. On the other hand, the server already exists, and you don’t need to worry about deploying it from scratch. In the cloud world, service is a compute unit that can be easily deployed or terminated at any given moment of time. Conducting manual operations in the cloud-like server configuration, deployment, and so on is a way to disaster. At the same time, development process on cloud is very expensive, and if you need to development an environment that is different from the target environment, you may face some challenges during the deployment. Challenges are not fun, so you’d better start the whole process with building the right development environment for development and operational teams at the initial stages. You can evaluate a current state of the project with the help of the project maturity model. By the way, all team members need access and possibility to run a copy of the development environment locally or on the cloud. In case anything gets broken, an engineer should be able to rebuild the environment from scratch.
To achieve these goals consider the following tools:
- Vagrant is, in fact, an abstraction on a different virtualization and cloud platform. You can describe your server or even cluster configuration in the code. Vagrant enables you to use different provisioning tools, such as plain script, chef, puppet, ansible, etc. To bootstrap a virtual machine. You can use vagrant as a playground for cluster configuration testing, as well as for testing of the application deployment. This way, software engineers can test an application in the environment that is pretty close to the target configuration.
- Using provisioning tools (chef, puppet, ansible or even plain bash script) is much better than a manual deployment;
- Docker is another great tool for the infrastructure as code, especially if you’re going to use containerization on the cloud. With the help of Docker and Docker Compose you can emulate the target infrastructure and deploy on cloud only tested container images.
As a summary, I’d like to stress that the overall approach to architecture design for the cloud is the same as to any other software. As usual, you need to start with defining the target quality attributes that describe your expectations, then apply one of the tactics mentioned above. The difference between designing for the cloud and for on-premises is that in the cloud infrastructure is also software. It means that you need to design your cloud solution as if software and infrastructure are one single piece of the whole. To do this, you will need to find tradeoffs between software design and infrastructure. However, every application and business case is unique so the tradeoff will have to be customized and carefully considered to find some harmony between the target quality attributes, design, and business needs.