Creating a good Docker image is an art. There are no fixed rules that can be applied in every situation. Instead, we need to look at the pros and cons of every decision. We can however provide guidance.
Here are 7 golden rules for docker images. Following these rules, you can improve the containers you build, making them more reusable, more efficient and more stable.
The prime requirement for all scalable containers is to never keep track of a state. Every action should be executed in it’s own context, without the need to store long-term information anywhere inside the micro-service. This means that permanent storage, like databases, data files or caches should not be living inside the container. When no data lives inside the container, it means that requests can be executed by any copy of the micro-service, so we can load-balance the requests, or recycle malfunctioning containers.
Statelessness is hard to achieve. It requires that the software you try to encapsulate is written in way that allows statelessness. When done properly, the software should allow you to push the state to an external resource, such as a database. As a hint, you could try to put your datafiles in central folders that can be mounted as external shared volumes (watch out for file-locks and concurrency on the files), make use of external, shared caches etc.
If the software doesn’t allow a stateless runtime, you might be able to use clustering features of the software. This is not advised, because it puts extra requirements on the network setup of your runtime, but it can allow you to run load-balanced.
My request should be handled by any container, independent of previous requests.
When your container travels through the development landscape towards production, it is downloaded and uploaded many times. Even when it has hit production, it will be copied and unpacked every time a new instance is started. Even though disk, memory and cpu might be cheap, the total sum can add up. Look at the follow examples:
- When an micro-service fails, we need a replacement instantly and all the delay during unpacking should be avoided.
- After a disaster, all services might need to start at the same time, congesting the bandwidth for downloading image and other resources for unpacking the docker image.
- Your artifact repository might hold hundreds of versions of you images. Even with good housekeeping rules, the diskspace might grow beyond what is available.
What can you do about this?
- Clean up you package repository cache after using it to install. Package manager like yum and apt download a copy of the version information of all available packages. You won’t need that inside your running container, so clean up after installing.
- Make sure you clean up at the end of every line in your Dockerfile. Docker creates a filesystem snapshot after every line, and stores the diff as a layer. Cleaning up on the next line will not reduce space, but instead use more space.
Load-balancing and high-availability depend on the ability to react to changing conditions in a timely manner. If a container fails, you wish to have it replaced now, not one hour later. When sales are peaking on your website, you wish to have extra capacity now, without any lead time, or you might miss some revenue.
Your micro-service should be able to start fast, and without dependencies to external systems. You don’t want your service to download extra packages from a repository system at startup, but instead all the packages should be part of the docker image. A service should not register itself on a central server, other services should be able to connect to your service by using a well-known url or name, such as a openshift service name. There should never be an external license server that needs to authorize your instance, and that becomes a show-stopper if it is unavailable.
Configuration is all about being able to change behavior without rebuilding the image
Think about how others will use your micro-service. What flexibility can we give them? Does your container need a server name and port to access an external database, or can we provide a full jndi url which allows more fine-tuning? Try not to restrict the use of your container. Containers are for IT-experts, not for end-users, so give them a powerful interface. It will be used by competent people that are trying to make it work in a situation you might not have foreseen, so try to give them the tools.
There are multiple ways to inject configuration into your container.
- Environment variables are the easiest to use. They are clear, easy to find and well understood. They are however immutable once the container has started. The process that runs inside the container gets a copy at startup.
- Configuration files are a bit more complex. The format depends on your software, and the may be scattered all over your file-system. However, good software package are able to detect changes at runtime and can re-load the file. Also, files can be mounted, so multiple images can use the same central configuration file, which makes it easier to maintain the settings.
Other solutions, such as storing settings in the central database are possible, but not advisable in a micro-service landscape. If you are running a large openshift or kubernetes environment, you wish to see config changes, and hiding them in a database is not advisable.
When configuration is all about changing the runtime behavior of the container within the bounds of the software, extendability is about being able to enhance software behavior by building on top of the image.
Many container images on docker.io are build according to this principle. When you use an image of an application service such as glassfish, you can choose to start the container and to upload your application modules to the running instance. This is however is a painful process that needs to be repeated every time you start the container. Instead you can choose to build a new docker image on top of the application service image with the packages pre-installed. When you do so, you have extended the original micro-service with your code, creating a new micro-service.
When you design your micro-service, think about how others can extend it. Can they add classes, libraries or other things that enhance the behavior? Maybe you should split your container into two parts: a reusable base image and a customized extension to that image for your single purpose.
Docker containers are build in layers. A layer is basically a diff: we take the previous layer and apply a number of changes to it, in order to arrive at a new situation. Each layer adds a piece of the final image, and each layer depends on a previous layer.
We already mentioned that we should avoid adding unnecessary data to the layers, but even when we avoid that, we still need to look critically at the layer structure. When you create a docker image, you are focused on the end result: making it work. This is your primary goal. Once you have it working, you should review your Dockerfile, and see if you need to make changes.
Docker tries to be smart about the layer structure. Whenever you use a docker image, the layers are downloaded and cached locally. A cached layer can be re-used when 1) The chain of layers from the root layer until this layer is exactly the same and 2) this layer itself has not changed. As soon as you make a change to one layer, it invalidates the current layer and all layers that come after it. All these layers will need to be downloaded again, even-though the previous version of the layer was cached and the layer itself has not changed.
When you look at the order of the layers, you should follow the following guideline:
large before small, stable before volatile
When the order of two lines in your docker file is not defined by any dependencies, you should consider the above rule. Ideally the line that constitutes to the larges layer in size, should be the one that is on top. Also lines that are not subject to change in a next version should come before lines that will change, such as lines referring to a specific version of a package. These two rules allow docker to use its cache more effectively, reducing memory, disk space and bandwith significantly. As a bonus, the build-time for images is also reduced anytime you make a change to one of the volatile layers.
You should use explicit versioning, always.
Dockerfiles are code, just like any other language. You put the in source-control and use a compiler (docker) to build it into an executable (the container). You want this process to be predictable and repeatable. You don’t want it to break suddenly when a dependency is updated. Imagine your container is in production and happily running for more than a year. A small change is requested and you agree to it. You take the Dockerfile out of source control, only to find out that it fails to build. Now you are left with an investigation that is preventing you from meeting your deadline.
What are the pieces you need to version?
- The docker image in the FROM header. This is quite obvious.
- The software package you encapsulate in the container, still a no-brainer.
- The libraries and packages used by the software.
- And finally, the tools you install with apt, yum etc. to prepare the docker image.
This last line is often forgotten. If you use a packaging system from a distribution, these packages also change. The behavior or interface may change slightly. The newer versions might be incompatible with the old distribution you are using through your FROM image. Make sure you version the tools, and that the tools remain available for download, for example by copying them to an artifact repository under your control.
There are many things to take into account when we create a container image. This makes the creation of a good image an art in its own. Good design might not be apparent at first. If the program inside the container runs correctly, who will complain? Only when an image is used extensively, will the flaws become visible. By following the steps in this article, you can remain clear of many of the pitfalls.
Component testing is an important protection against regression errors. After every change to your component, you should test its public interfaces in isolation from the environment it runs in. In classic OTAP setups, this can be a pain, but using Docker, you can avoid many of the problems by creating a dedicated environment, just for the occasion.
Our test strategy consists of just 5 simple steps for component testing, that are fully automated using a Jenkins build server:
1) Perform unit test after compiling your code
During development, it’s important to get quick feedback on errors in your code. The GUI you use is the first layer, and the most direct protection. It protects against syntax errors. The second layer is the unit test. It should verify that your code has no erroneous constructions, like breaking on empty lists and other out of bounds exceptions. It should focus on the technical details of your implementation. It should test the constructions in your code, but take care your are not testing the libraries and such that you are using. Libraries have their own test suites, and duplicating these tests will not add any value.
If your are using a code-quality gateway, such as SonarQube, it should be invoked just after the unit-tests. A gateway will improve the quality of the entire project by enforcing code standards, unittest coverage and by preventing architectural debt in your code. It reduces the burden of peer-reviews by automating the bulk of the review work, leaving only the interesting work to the developers.
2) Create a docker image of your component, as usual
Once you have passed your unit tests, you should create a docker image of your component, ready for deployment in the environment. This is a candidate image for production, and it will not be changed anymore. Whenever it passes a test-phase, it will be promoted to a next environment. This means that our docker image needs to be configurable for different environments, but the executable inside together with the internal structure of the docker image must be final.
3) Create a docker image from the image of step 2, and add mocks and settings
The image created in step 2 is final, but Docker allows us to derive from an image and add extra components. We run an application server in our docker image, where the component is deployed. In the same application server, we can deploy our mock services. All external api’s used by our component are mocked using the same platform as our component itself. This is important, because when using Docker, you should have only one executable running per container. This executable should perform the role of both the component as the mocked services. It also simplifies things, because the developer needs only one skill-set instead of two: the application and the mock framework.
The configuration for our component is also added to this image, so that it connects to the mocked api’s out of the box instead of the external api’s. The docker image needs no further configuration, and is ready to respond to our test messages directly after spinning up.
4) Deploy the image of step 3, and run your component tests against your mocked component
Our component uses one dependency that is hard to mock: the database. This can be circumvented by creating a dedicated database per test. Again, Docker shows its strengths, as we can just spin up a Docker database image together with our component test image. This implies that our component must be able to create it’s own database structure, or that we have a database image with the predefined structure available. We use the former.
Now that our component is running together with it’s mocks and database dependencies, we can initiate the component test suite from Jenkins. All tests are run in isolation, on the just created stand-alone environment, and results are gathered.
The things we verify in the component test phase are functional, and can be written down using the following format: given that the mocks provide certain data, when I call the provided public api of my component, then I expect a certain result. For example: given that a customer X is returned from the customer mock service, when I call the order service to create an order for customer X, the result should be that an order is created.
A good practice is to write down small scenario’s of business events and bundle each scenario in a testcase. Testcases should be independent of eachother, so you shouldn’t use database data stored in one scenario to execute the next one. The only dependencies are between steps inside a scenario, where you can create something, read it, update it, etc. This way can can choose the order of the scenarios and perhaps limit your testing to one case when you try to reproduce an issue.
5) Proceed to deploy the image of step 2, and perform integration and system tests as usual
Once the component testing is successful, the component test environment is deleted, since it isn’t needed anymore, and it should be newly created before every test.
We take the base image we created in step 2 and deploy it on our integration test environment.
Some points to take away
- Our component is able to create its own database structure from scratch, so we can start with an empty database every time.
- We use an application server to host both the component and the mock services
- We build a mocked docker image on top of our production-ready image
- Jenkins is used to create and destroy the docker environments
- Docker compose can be used to create an environment, but specialized products such as Kubernetes or OpenShift make life for a developer much easier.
- Component testing can seem expensive, but the longer the software lives, the more value is returned from component tests. Don’t skip out on the tests, but make implementing tests easier.
The first step in BPM processes is often ad hoc: A customer fills in a request form, an employee reports an issue or a salesperson enters a new sale. You can’t predict when exactly they will occur, but as soon as the task is executed, the business starts working on fulfilling the job. These ad hoc process tasks are an integral part of the business process, or are they not?
Business and IT often tend to disagree at this point. Lets analyse what makes this situation such a problem. What does the business expect from a BPM process, and what does IT expect?
The business perspective
From a business point of view, every step in the business process must be clear. Business analysts need to write down what is entered in this task, and verify that it contains the data required for the rest of the process. The screen-flow might be defined here, the actor is written down and a RACI matrix can be defined. Every task has clear in- and outputs. Certain conditions may apply, and tools might be needed to fulfill the task. This is the domain of the business analysts, and provides handholds for auditors and managers.
As such this first task is important to include in the BPM flow.
The IT perspective
From IT perspective however, this is a very awkward situation. IT gets involved to automate the workflow, but where does the workflow start? The business process magically begins just before the user starts her work. How does the system know it needs to send a task to the user? If the user is not part of the organisation, can we even send a task to the user? What if the user halfway decides not to fill in the request, do we still have a process then? What if she continues later on?
The system can’t predict that there will be a process, so it has to wait until the user finalizes the task. Once the task is committed, the process starts and the system takes control. This implies that the first task is not part of the workflow. The best way for IT to build this, is to have a data-entry screen in some user interface application. The user interface is free format, and starts the remainder of the workflow by sending a start message.
We could compromise by defining a manual task directly after the process start. This task will not be executed by the workflow engine, but can be used as a placeholder for the information that belongs to the ad hoc process task at the start of the process. The user will enter data using some external front-end application which acts as the manual user task.
This leaves us with two separate types of tasks
- Ad hoc tasks, which are modeled as manual tasks in the BPMN language and aren’t converted to the workflow. They are implemented in the UI layer, and send a message to the workflow engine, which needs a message event to receive the information.
- Assigned user tasks, which are modeled as plain user tasks in BPMN. These have a user, role or group that needs to complete the task, and are the bread and butter of the workflow.
Since we model both tasks, the business can have a complete view of the process. The fact that we don’t execute manual tasks means that we can automate the workflow. If we use the same user interface application to wrap both the ad hoc tasks and the workflow tasks, the integration can be done seamless and the end-users will never know the difference.
You know about good BPM design, about separation of long-lived flows and short-lived interactions. You want to move your forms outside of your long-lived flows, preferably in a dedicated GUI product, but the business requires you to put the forms in the BPM flow. Either because they want one system to handle all interaction, they need the persistence in every step of the user-interaction, or maybe because they need the transaction log for legal reasons. How do you solve this dilemma?
Where to put user forms in BPM
User forms are a good example of a short-lived interaction. The interaction can be persisted, but once the user has completed her task, the flow finishes. This type of interaction-flow typically changes a during its lifetime. The layout of the forms may change, extra fields may get added and the order of the forms in one interaction may even change. This fast-changing piece of code doesn’t belong with the stable code for a business process: the core process almost never changes.
Fast changing code doesn’t belong together with a stable business process
The solution is quite simple. We build at least two layers of process flows: the top layer consists of one long-lived flow that defines the basic steps of the business process. The layer below that consists of a short-lived flow per user task. These processes are packaged separately so that the version control on these packages can be done independently.
The business process defines the high level steps. When it needs a user interaction, it calls an other process and delegates the interaction this short-lived one. All the forms are in the short-lived process. The long-lived process starts the short-lived one, and waits for its completion. Once the sub-process has completed, the main process continues on.
- all interactions are inside the BPM engine: everything is logged and handled in a consistent way as required by the business.
- all forms are inside short-lived processes. If a form is changed, the old versions are gone within a couple of hours or maybe days. No longer do you need to support dozens of different form versions because the processes are still running.
- versioning and deployment of forms can be done independent of the business processes
There is one big pitfall here: if you move the forms into a sub-process, you will feel the urge to define an interface between the main and sub-process that passes the data from the main process to the sub-process and back. Don’t do this!
If you define an interface that passes along the actual data, you are binding your main process to the model of the data used in the forms. You have now moved your dependency, instead of removing it.
Instead, you need to store the actual data somewhere else, for example inside a database. You just pass a reference to the data when you start or end the sub-process. All bindings to the data-model are now removed from the main process. The main process can use the data when required, but does not need to know the shape of the data.
Don’t pass data from the main process to the subprocess and back, but only pass references.
Keep your data out of your BPM process, but store references instead. Put your data in a database, that is what it is for!
If you keep fine grained, often changing elements such as forms and data outside of your BPM flow, you end up with a more maintainable design. The BPM process itself just defines what steps to take, but should not define how to do them. Instead it delegates to subprocesses and services that perform the task at hand.
At the start of any large business automation project (after the analysis of the business domain, but before the design starts) you should always ask yourself the question: will I be using case management or classic BPM? This decision has a huge impact on the final result of your BPM project. But what is the difference between case management and classic BPM? And what will be the impact of choosing either solution?
Business Process Management originally was a method of analyzing and optimizing the processes in a business, as opposed to focusing on the individual tasks. By looking at the tasks as a part of the whole, synergy between tasks can be found, and improvements can be made. When this principle was elevated to the IT-era, it became a method that defines all steps in a business process, and made those steps executable.
At the core of this BPM philosophy is the thought that all cases are the same (or one of only a few possible options). Every process therefore goes through exactly the same steps. The content may vary, but the process stays the same. Classic BPM defines the order in which the steps occur. It is a great tool that ensures all necessary tasks have been finished. In the mean time it is excellent at keeping record of what has been done, by whom and when.
Simple BPMN flow
The mayor advantages of using this type of BPM are predictability, simplicity and visibility. A process is at a certain step, we know who is responsible for that step, and we can find important statistics such as workload of the employees. All possible actions at a certain point in the process are known to us, because everything is defined upfront in the process flow.
This upfront definition of all possible orders of steps is also the problem with classic BPM. In simple, straight-forward flows, we can do this easily. With longer, complex flows it quickly becomes near impossible. Cases may involve completely different steps, and share very little. Imagine the business process of requesting an official permit for building a house or factory. This type of permit needs to be investigated by many different departments: does it comply with the local plans, are there prior applications that were rejected, does it pose environmental risks, etc.
If you’d put all of these steps in sequence, it would take too long to evaluate the plan. But running all possible steps in parallel can become a big tangle of dependencies. What if the output of one department influences the flow for the other department? Do we need to model all these types of relations in our big business process?
Case Management Example in BPMN
Instead of focusing on the process, we could also focus on the subject. This is what case management does. Case management consists of many short processes that all operate on a central piece of business information: the case. Each process changes the state of the business information, and each state change might trigger new processes. For example, a high-over environment scan might show that an expert needs to investigate the risks to the groundwater. The high-over scan changes the state on the case, and a new business process is started for the environmental expert to make a report. There is no hard link between these two processes, so they don’t need to know about each-other. This makes maintenance a lot easier.
Synergy with complex event processing
In the classic BPM example, the process is predefined and once started, runs it course with minimal outside interaction. Though it is possible to receive signals from outside, this is not often used in BPM. The idea was to define all steps, including the steps that lead up to the signal, so why would you need a signal from outside? It means that you have missed a part that should have been in the model.
In case management, we have an external source: the case . This business object needs to be monitored, and every state change might trigger a new process. Monitoring an object for changes is exactly what complex event handling is good at. We can monitor all changes on the object. When certain combinations of signals are found, we start a new BPM process to handle the situation. The business process itself doesn’t need the complex logic of monitoring the state, as it would have in the classic BPM example. The complexity is shifted towards a rule engine, which is easier to work with, and dedicated for this kind of complexity.
Reduced complexity with short-lived processes
The move from one big BPMN flow to many short-lived BPM processes with a shared case object makes the design a lot easier. The complex decision logic is removed from the BPM flow, making the design simpler and more maintainable. The short-lived BPM processes are also to be preferred over long-lived processes, because the governance of long-lived processes tends to be a lot harder. Imagine finding a bug in a process. When the process lives for 3 hours max, we know that after installing the fix, the bug will have left the system on its own after 3 hours, and we don’t have to keep track of it anymore.
Now if the process remains active for 3 months, we would need to keep track of it for all that time. For every change on the system during those 3 months, we need to think of this particular bug and its impact. For every incident report, we need to cross-check it with this bug. This might sound trivial, but in a large, fast-changing environment there might be a hundred small bugs that all demand a bit of attention. Added up, the support organisation might get bogged down by it. Reduction of the complexity is so very important to keeping the support organisation flexible.
When to use case management
Case Management isn’t a magical solution to all BPM problems. It fits a certain area of the BPM domain. When your are facing a straightforward workflow, case management might not be a good solution. The simplicity of the classic BPM can make communicating about it much easier. The BPMN flow is easier to comprehend, and holds all the possible interactions.
Only when the flow becomes more complicated, with parallel tasks and many conditions, case management begins to shine. Taking the complexity out of the BPMN design, and into a stateful business object, allows us to manage the problems. Moving the complexity into a rule engine reduces complexity in the flows even more. Most import thing to take away here is that you need to avoid complexity in long-living processes, since they are hard to govern. Case Management is a good tool to accomplish this.
In a previous post, I showcased an example of waste in a customer environment. The customer had two massive servers in a fail-over setup, which resulted in low utilization. The customer was paying for a system that was at best 50% utilized, so 50% of his costs were for the fail-over scenario that may happen once or twice during the lifetime of the servers. To reduce the cost, we can switch to a scalable environment. But how do you achieve a scalable environment? How do you implement cloud scaling?
How to imlement cloud scaling
The general approach in cloud design is to apply the pets versus cattle idiom:
Pets versus Cattle
Pets versus Cattle
A pet is unique. It is something you care for, put in lots of effort and if something happens to it, it can’t be replaced without loss.
Cattle is uniform. If it gets sick or it misbehaves, you take it away and get new cattle. The replacement costs are low.
How does this apply to cloud computing and scaling? We must make sure our servers are not like pets: high value, high risk, hard to replace. We can do this by taking the following steps.
Before we can make a scalable cluster of servers, we first need to make sure the server needs as little attention as possible. All steps for startup and shutdown should be automated. This allows an automated system to manage the server without the need of a human to intervene. The system can adjust to the actual load during the night, in weekends and during the holidays at no extra costs.
Automating startup is not enough. If it takes two hours to start an extra server, the need may already have passed. Even worse, the load may have been to high for the other servers to handle, causing problems and unhappy customers. The startup process should be as light as possible.
In order to do this, the server should be pre-installed as much as possible. All packages should be there in the correct version, and no installation steps should be required.
Pre-install all packages
To get to this point, we need a server image that isn’t a raw operating system like we normally use to start a fresh server. The server image we need is one that has been prepared with the correct software pre-installed. The complete runtime platform must be configured on the image so that no time is wasted installing modules. Once we have all platform software installed, we make our own image snapshot, and we’ll use this master image for all our server instances. Only one master image exists, and new servers are created by cloning the master image and starting the copy. The master image itself is never started.
Second, we will need the platform to load the actual packages. Where the platform software usually is stable and proven software, the actual business logic usually resides in custom build packages that are loaded in the platform. These packages are fast changing with business needs. To merge them with the server image would mean we either have to install the latest version after the server comes up, or we have to rebuild the server image whenever a new package is released. Both are unwanted.
Separate the platform and the custom software
One solution is to separate the packages from the platform. Packages need to be stored at a central location, configured and all, ready to be loaded by the server. Once the server starts, it loads the package and configuration from the central location without the need for an extra installation step. When the server shuts down, the configuration persists, while the server image is discarded.
To create packages that can be stored and accessed like this is not trivial. Not every platform supports this behavior. One of the possible solutions to this is the combination of micro-services and docker. We will go into these subjects in a future post.
Cloud scaling is not simply the process of starting more servers on demand. The server images need to be prepared in order to start automatically, efficiently and reliably. Once this is done, the cloud software is able to manage the server instances and to keep the utilization at the requested levels.
One important thing to note is that the above steps for preparing a scalable server image also hold for preparing any other scalable service. From docker images to database clusters to cloud scaling, every scalable environment shares these same principles.
People often wonder makes a good BPM process design. Should a process include many small steps, or should it be a few large steps? Do we avoid user actions, or do we make extensive page-flows as part of our process? I find that these things are not what makes a process design good or bad. Good design is mostly about avoiding pitfalls. Here are some tips that will make your BPM processes better.
Separation of business data and process state
The most important step for good BPM process design is to remove all your business data. You don’t need to keep the address of your customer in your BPM database. It introduces only problems, because what happens if your customer moves while your BPM flow is in the middle of the process? Instead you only want references to entities in your BPM. Everything else must be moved to a service elsewhere in your enterprise environment.
Keep the focus on the process: which steps do I need to take, and which decisions do I need to make? The business data is there to facilitate the flow, but instead of pulling the data inside and holding it inside the process, you need to access it on-demand and only store the answer of your question, not the data that leads up to the answer. You may want to log the decision making, but you should not keep the data itself for later use.
Not so long living processes
We often state that BPM is for long-living process as contrary to the normal, short-living EAI and SOA services. This is however a major pitfall. Long-living in this context means that it the process can be suspended for a while, it is able to survive a system reboot and may wait for user input for a while. It doesn’t mean that the process should live for months or years. It would be a very bad BPM process if you designed a BPM process that starts when somebody becomes a customer, and ends when she leaves. The process would potentially live for many years, pulling along the burden of many years of patches, changes and upgrades. In the end, every process would be unique: the combination of upgrades and the state at which the upgrade happens will make it so that no two processes share the same history. A system of many unique processes is maintainable, and impossible to govern.
Instead we should have multiple short lived business processes that focus on a certain state change: somebody becomes a customer. This state change is always about a business entity, and this entity will be the part that continues on after the process ends. The BPM process will be long-living, but in this context, that may be 1 hour for the administrative steps to finish, or a few days while the credit check is pending, etc.
This process will now live for a few days, making maintenance and government much easier.
Move forward, not backwards
An efficient business process never returns to a previous step. Your customer intake should include all information that is required for your BPM process, so that you don’t need to call the customer again for more information. The same holds for the other steps in your flow. Once a user has finished her input, the flow should move on and not return to her.
It should also hold for business services: you go to a service and complete your interaction in one, atomic step. This avoids a lot of pain, because otherwise it could happen that the business data is modified inbetween your steps, making your second step invalid. Handling these types of errors is one of the most costly parts of BPM maintenance.
In process design, this means that the decision making may need to occur during the interaction with said user. An extra screen with detailed questions may popup if a certain condition applies later in the business flow. By pulling this decision to an earlier moment in the flow, we can avoid going back to the user for follow up questions. This avoids stalls in the process, such as the original user being on holiday or otherwise unable to pick up this specific task.
Finally it also holds that a loop should not occur in the normal flow. You can have a loop that handles every item in an array, but you should avoid a loop that performs the same steps for the same business entity again and again. If such a business logic is required, it should be multiple business process instances started after eachother on the same business entity, not one big process.
There is however one exception: during error handling, a retry is very common, and in this case it is perfectly acceptable do loop back to a failed step to retry it.
Keep it simple stupid
Designers tend to overthink the BPM process. Many of the details can and should be handled outside of the stateful BPM process. Decision making for example can be handled in the BPM code, but it is harder to maintain than a simple service that provides a yes/no answer. The code in the BPM flow should then only test the answer of the service, and not do the underlying comparison itself. The business logic of these business rules tends to change often, while the BPM flow normally stays the same. A stateless service can be modified easily without issues with backward compatibility and job migrations, while a stateful BPM flow has all of this pain embedded. Keep the hard stuff outside of the BPM code, and only pull the results in the BPM scope.
Good BPM process design is not easy. It is mainly about avoiding the common pitfalls. The ones listed here are by no means complete, but they are often overlooked by the normal trainings you get by the BPM software development firms. The normal trainings are about what is possible, but they tend to forget that it may be unwise to do so. Don’t go blindly on what the tools offer: to them BPM is a hammer and every problem a nail. You should know when to use the hammer, and use it well.
Lately I was visiting a customer that was moving its applications to new servers. They had acquired two huge servers to host their middleware solution, connecting their on premise applications and their cloud-hosted solutions. The client went for an active-active setup, where both servers are handling the load, but if one server fails, the remaining server must be able to handle all the trafic. It sounds like a good setup, but it introduces a lot of waste. We will go into this and show why a cloud solution using auto-scaling is so much better.
The machines are scaled to handle peak performance. This is 100% load, and we must keep in mind that this is a large company with a big IT landscape, so it must handle a lot of data at times. This load is balanced over two instances, but if either server fails, the other must be able to take on the full load, so we can never put more than 50% load on one machine. This means that at best, we can expect to have 50% idle time on these machines. However, since we won’t be running at maximum capacity all day, only at a few peak-moments, the real loss can be expected to be larger, like 60-80% idle time.
To summarize, the client had bought two huge, expensive machines that would be working for 20-40% of the time…
What could they have done better?
Improved server utilization
We keep the original requirement of the client to have active-active setup, and to have one machine in reserve in case an other goes down. However, we change our approach from a few big machines to many, small machines.
Lets do the calculations again, now for 5 smaller machines instead of two big machines. Again we want to be able to handle a single machine failure.
100% load must be handled by our landscape of 5 machines, but if one fails, only 4 machines are left, so each machine must be scaled to handle 25% of the total traffic. During normal operations however, we have 5 machines available, so they will run at 20% of the traffic at peak-moments. This means that at peak moments, the server will be running at 20/25 * 100 = 80% of the maximum load the machine can handle. In other words, it will have 20% idle time, instead of 50% idle time in the previous example.
You pay for capacity, so having less idle time means you’ll be paying less for your servers.
Can we improve on this?
The cloud approach to such a scaling problem would be to have auto-scaling group. Such an auto-scaling group consists of a number of smaller machines, all virtualized. In the above example, we would have created a group of max 5 machines, and a minimum of two machines. This satisfies the requirements of the customer that we must have one hot standby at all times, and we must be able to handle 100% load at peak-moments. The auto-scaling group can spin up extra servers when the need arises, or it can remove servers if the load drops below a certain threshold. The rules must be set so, that the total load of the n servers will never be more than n-1 servers could handle, because when a server fails, the others must be able to handle the load. As soon as this threshold is met, a new instance will be started in order to lower the load per machine.
Now, whenever the load of the system drops below the peak-load, the auto-scaling group can turn of servers to reduce cost. When the load increases again, the system simply starts a new server.
The thresholds for scaling up- and down can be calculated using the following formula:
let S be the sum of the loads of all the running instances, let T be the maximum load that can be handled by all running instances, and let N be the number of running instances.
upper-bound: if S > (N-1 / N) * T then scale up
lower-bound: if S < (N-2 / N-1) * T then scale down
If the loads fluctuates around a boundary, the number of servers would keep going up and down. To prevent this from happening, a delay is introduced by the cloud hoster, but we should also lower the lower-bound by 5% or so to have some bandwidth in which the number of servers is stable. We can’t touch the upper bound, since it would violate the one server standby requirement.
Of course down-scaling can only occur for N>2, as we need a minimum of 2 servers. Up-scaling could occur unlimited, we only introduce a soft limit of 5 servers to limit the maximum costs, but if business needs change, we could up this limit.
From this follows that the load of N machines (when we have more than 2) will never be below N-2 / N-1, for example the load of 3 machines will never be below 1/2 (50%), the load of 4 machines will never be below 2/3 (67%) etc. The higher the load, the better the utilization becomes, and when the load drops, the extra machines are shut down to improve utilization again.
The result will be that depending on the fluctuations in your landscape, your average utilization percentage will increase even more, resulting in a lower running cost of your landscape.
How to achieve this auto-scaling behavior will be addressed in an upcoming post.