At the start of any large business automation project (after the analysis of the business domain, but before the design starts) you should always ask yourself the question: will I be using case management or classic BPM? This decision has a huge impact on the final result of your BPM project. But what is the difference between case management and classic BPM? And what will be the impact of choosing either solution?
Business Process Management originally was a method of analyzing and optimizing the processes in a business, as opposed to focusing on the individual tasks. By looking at the tasks as a part of the whole, synergy between tasks can be found, and improvements can be made. When this principle was elevated to the IT-era, it became a method that defines all steps in a business process, and made those steps executable.
At the core of this BPM philosophy is the thought that all cases are the same (or one of only a few possible options). Every process therefore goes through exactly the same steps. The content may vary, but the process stays the same. Classic BPM defines the order in which the steps occur. It is a great tool that ensures all necessary tasks have been finished. In the mean time it is excellent at keeping record of what has been done, by whom and when.
Simple BPMN flow
The mayor advantages of using this type of BPM are predictability, simplicity and visibility. A process is at a certain step, we know who is responsible for that step, and we can find important statistics such as workload of the employees. All possible actions at a certain point in the process are known to us, because everything is defined upfront in the process flow.
This upfront definition of all possible orders of steps is also the problem with classic BPM. In simple, straight-forward flows, we can do this easily. With longer, complex flows it quickly becomes near impossible. Cases may involve completely different steps, and share very little. Imagine the business process of requesting an official permit for building a house or factory. This type of permit needs to be investigated by many different departments: does it comply with the local plans, are there prior applications that were rejected, does it pose environmental risks, etc.
If you’d put all of these steps in sequence, it would take too long to evaluate the plan. But running all possible steps in parallel can become a big tangle of dependencies. What if the output of one department influences the flow for the other department? Do we need to model all these types of relations in our big business process?
Case Management Example in BPMN
Instead of focusing on the process, we could also focus on the subject. This is what case management does. Case management consists of many short processes that all operate on a central piece of business information: the case. Each process changes the state of the business information, and each state change might trigger new processes. For example, a high-over environment scan might show that an expert needs to investigate the risks to the groundwater. The high-over scan changes the state on the case, and a new business process is started for the environmental expert to make a report. There is no hard link between these two processes, so they don’t need to know about each-other. This makes maintenance a lot easier.
Synergy with complex event processing
In the classic BPM example, the process is predefined and once started, runs it course with minimal outside interaction. Though it is possible to receive signals from outside, this is not often used in BPM. The idea was to define all steps, including the steps that lead up to the signal, so why would you need a signal from outside? It means that you have missed a part that should have been in the model.
In case management, we have an external source: the case . This business object needs to be monitored, and every state change might trigger a new process. Monitoring an object for changes is exactly what complex event handling is good at. We can monitor all changes on the object. When certain combinations of signals are found, we start a new BPM process to handle the situation. The business process itself doesn’t need the complex logic of monitoring the state, as it would have in the classic BPM example. The complexity is shifted towards a rule engine, which is easier to work with, and dedicated for this kind of complexity.
Reduced complexity with short-lived processes
The move from one big BPMN flow to many short-lived BPM processes with a shared case object makes the design a lot easier. The complex decision logic is removed from the BPM flow, making the design simpler and more maintainable. The short-lived BPM processes are also to be preferred over long-lived processes, because the governance of long-lived processes tends to be a lot harder. Imagine finding a bug in a process. When the process lives for 3 hours max, we know that after installing the fix, the bug will have left the system on its own after 3 hours, and we don’t have to keep track of it anymore.
Now if the process remains active for 3 months, we would need to keep track of it for all that time. For every change on the system during those 3 months, we need to think of this particular bug and its impact. For every incident report, we need to cross-check it with this bug. This might sound trivial, but in a large, fast-changing environment there might be a hundred small bugs that all demand a bit of attention. Added up, the support organisation might get bogged down by it. Reduction of the complexity is so very important to keeping the support organisation flexible.
When to use case management
Case Management isn’t a magical solution to all BPM problems. It fits a certain area of the BPM domain. When your are facing a straightforward workflow, case management might not be a good solution. The simplicity of the classic BPM can make communicating about it much easier. The BPMN flow is easier to comprehend, and holds all the possible interactions.
Only when the flow becomes more complicated, with parallel tasks and many conditions, case management begins to shine. Taking the complexity out of the BPMN design, and into a stateful business object, allows us to manage the problems. Moving the complexity into a rule engine reduces complexity in the flows even more. Most import thing to take away here is that you need to avoid complexity in long-living processes, since they are hard to govern. Case Management is a good tool to accomplish this.
In a previous post, I showcased an example of waste in a customer environment. The customer had two massive servers in a fail-over setup, which resulted in low utilization. The customer was paying for a system that was at best 50% utilized, so 50% of his costs were for the fail-over scenario that may happen once or twice during the lifetime of the servers. To reduce the cost, we can switch to a scalable environment. But how do you achieve a scalable environment? How do you implement cloud scaling?
How to imlement cloud scaling
The general approach in cloud design is to apply the pets versus cattle idiom:
Pets versus Cattle
Pets versus Cattle
A pet is unique. It is something you care for, put in lots of effort and if something happens to it, it can’t be replaced without loss.
Cattle is uniform. If it gets sick or it misbehaves, you take it away and get new cattle. The replacement costs are low.
How does this apply to cloud computing and scaling? We must make sure our servers are not like pets: high value, high risk, hard to replace. We can do this by taking the following steps.
Before we can make a scalable cluster of servers, we first need to make sure the server needs as little attention as possible. All steps for startup and shutdown should be automated. This allows an automated system to manage the server without the need of a human to intervene. The system can adjust to the actual load during the night, in weekends and during the holidays at no extra costs.
Automating startup is not enough. If it takes two hours to start an extra server, the need may already have passed. Even worse, the load may have been to high for the other servers to handle, causing problems and unhappy customers. The startup process should be as light as possible.
In order to do this, the server should be pre-installed as much as possible. All packages should be there in the correct version, and no installation steps should be required.
Pre-install all packages
To get to this point, we need a server image that isn’t a raw operating system like we normally use to start a fresh server. The server image we need is one that has been prepared with the correct software pre-installed. The complete runtime platform must be configured on the image so that no time is wasted installing modules. Once we have all platform software installed, we make our own image snapshot, and we’ll use this master image for all our server instances. Only one master image exists, and new servers are created by cloning the master image and starting the copy. The master image itself is never started.
Second, we will need the platform to load the actual packages. Where the platform software usually is stable and proven software, the actual business logic usually resides in custom build packages that are loaded in the platform. These packages are fast changing with business needs. To merge them with the server image would mean we either have to install the latest version after the server comes up, or we have to rebuild the server image whenever a new package is released. Both are unwanted.
Separate the platform and the custom software
One solution is to separate the packages from the platform. Packages need to be stored at a central location, configured and all, ready to be loaded by the server. Once the server starts, it loads the package and configuration from the central location without the need for an extra installation step. When the server shuts down, the configuration persists, while the server image is discarded.
To create packages that can be stored and accessed like this is not trivial. Not every platform supports this behavior. One of the possible solutions to this is the combination of micro-services and docker. We will go into these subjects in a future post.
Cloud scaling is not simply the process of starting more servers on demand. The server images need to be prepared in order to start automatically, efficiently and reliably. Once this is done, the cloud software is able to manage the server instances and to keep the utilization at the requested levels.
One important thing to note is that the above steps for preparing a scalable server image also hold for preparing any other scalable service. From docker images to database clusters to cloud scaling, every scalable environment shares these same principles.
People often wonder makes a good BPM process design. Should a process include many small steps, or should it be a few large steps? Do we avoid user actions, or do we make extensive page-flows as part of our process? I find that these things are not what makes a process design good or bad. Good design is mostly about avoiding pitfalls. Here are some tips that will make your BPM processes better.
Separation of business data and process state
The most important step for good BPM process design is to remove all your business data. You don’t need to keep the address of your customer in your BPM database. It introduces only problems, because what happens if your customer moves while your BPM flow is in the middle of the process? Instead you only want references to entities in your BPM. Everything else must be moved to a service elsewhere in your enterprise environment.
Keep the focus on the process: which steps do I need to take, and which decisions do I need to make? The business data is there to facilitate the flow, but instead of pulling the data inside and holding it inside the process, you need to access it on-demand and only store the answer of your question, not the data that leads up to the answer. You may want to log the decision making, but you should not keep the data itself for later use.
Not so long living processes
We often state that BPM is for long-living process as contrary to the normal, short-living EAI and SOA services. This is however a major pitfall. Long-living in this context means that it the process can be suspended for a while, it is able to survive a system reboot and may wait for user input for a while. It doesn’t mean that the process should live for months or years. It would be a very bad BPM process if you designed a BPM process that starts when somebody becomes a customer, and ends when she leaves. The process would potentially live for many years, pulling along the burden of many years of patches, changes and upgrades. In the end, every process would be unique: the combination of upgrades and the state at which the upgrade happens will make it so that no two processes share the same history. A system of many unique processes is maintainable, and impossible to govern.
Instead we should have multiple short lived business processes that focus on a certain state change: somebody becomes a customer. This state change is always about a business entity, and this entity will be the part that continues on after the process ends. The BPM process will be long-living, but in this context, that may be 1 hour for the administrative steps to finish, or a few days while the credit check is pending, etc.
This process will now live for a few days, making maintenance and government much easier.
Move forward, not backwards
An efficient business process never returns to a previous step. Your customer intake should include all information that is required for your BPM process, so that you don’t need to call the customer again for more information. The same holds for the other steps in your flow. Once a user has finished her input, the flow should move on and not return to her.
It should also hold for business services: you go to a service and complete your interaction in one, atomic step. This avoids a lot of pain, because otherwise it could happen that the business data is modified inbetween your steps, making your second step invalid. Handling these types of errors is one of the most costly parts of BPM maintenance.
In process design, this means that the decision making may need to occur during the interaction with said user. An extra screen with detailed questions may popup if a certain condition applies later in the business flow. By pulling this decision to an earlier moment in the flow, we can avoid going back to the user for follow up questions. This avoids stalls in the process, such as the original user being on holiday or otherwise unable to pick up this specific task.
Finally it also holds that a loop should not occur in the normal flow. You can have a loop that handles every item in an array, but you should avoid a loop that performs the same steps for the same business entity again and again. If such a business logic is required, it should be multiple business process instances started after eachother on the same business entity, not one big process.
There is however one exception: during error handling, a retry is very common, and in this case it is perfectly acceptable do loop back to a failed step to retry it.
Keep it simple stupid
Designers tend to overthink the BPM process. Many of the details can and should be handled outside of the stateful BPM process. Decision making for example can be handled in the BPM code, but it is harder to maintain than a simple service that provides a yes/no answer. The code in the BPM flow should then only test the answer of the service, and not do the underlying comparison itself. The business logic of these business rules tends to change often, while the BPM flow normally stays the same. A stateless service can be modified easily without issues with backward compatibility and job migrations, while a stateful BPM flow has all of this pain embedded. Keep the hard stuff outside of the BPM code, and only pull the results in the BPM scope.
Good BPM process design is not easy. It is mainly about avoiding the common pitfalls. The ones listed here are by no means complete, but they are often overlooked by the normal trainings you get by the BPM software development firms. The normal trainings are about what is possible, but they tend to forget that it may be unwise to do so. Don’t go blindly on what the tools offer: to them BPM is a hammer and every problem a nail. You should know when to use the hammer, and use it well.
Lately I was visiting a customer that was moving its applications to new servers. They had acquired two huge servers to host their middleware solution, connecting their on premise applications and their cloud-hosted solutions. The client went for an active-active setup, where both servers are handling the load, but if one server fails, the remaining server must be able to handle all the trafic. It sounds like a good setup, but it introduces a lot of waste. We will go into this and show why a cloud solution using auto-scaling is so much better.
The machines are scaled to handle peak performance. This is 100% load, and we must keep in mind that this is a large company with a big IT landscape, so it must handle a lot of data at times. This load is balanced over two instances, but if either server fails, the other must be able to take on the full load, so we can never put more than 50% load on one machine. This means that at best, we can expect to have 50% idle time on these machines. However, since we won’t be running at maximum capacity all day, only at a few peak-moments, the real loss can be expected to be larger, like 60-80% idle time.
To summarize, the client had bought two huge, expensive machines that would be working for 20-40% of the time…
What could they have done better?
Improved server utilization
We keep the original requirement of the client to have active-active setup, and to have one machine in reserve in case an other goes down. However, we change our approach from a few big machines to many, small machines.
Lets do the calculations again, now for 5 smaller machines instead of two big machines. Again we want to be able to handle a single machine failure.
100% load must be handled by our landscape of 5 machines, but if one fails, only 4 machines are left, so each machine must be scaled to handle 25% of the total traffic. During normal operations however, we have 5 machines available, so they will run at 20% of the traffic at peak-moments. This means that at peak moments, the server will be running at 20/25 * 100 = 80% of the maximum load the machine can handle. In other words, it will have 20% idle time, instead of 50% idle time in the previous example.
You pay for capacity, so having less idle time means you’ll be paying less for your servers.
Can we improve on this?
The cloud approach to such a scaling problem would be to have auto-scaling group. Such an auto-scaling group consists of a number of smaller machines, all virtualized. In the above example, we would have created a group of max 5 machines, and a minimum of two machines. This satisfies the requirements of the customer that we must have one hot standby at all times, and we must be able to handle 100% load at peak-moments. The auto-scaling group can spin up extra servers when the need arises, or it can remove servers if the load drops below a certain threshold. The rules must be set so, that the total load of the n servers will never be more than n-1 servers could handle, because when a server fails, the others must be able to handle the load. As soon as this threshold is met, a new instance will be started in order to lower the load per machine.
Now, whenever the load of the system drops below the peak-load, the auto-scaling group can turn of servers to reduce cost. When the load increases again, the system simply starts a new server.
The thresholds for scaling up- and down can be calculated using the following formula:
let S be the sum of the loads of all the running instances, let T be the maximum load that can be handled by all running instances, and let N be the number of running instances.
upper-bound: if S > (N-1 / N) * T then scale up
lower-bound: if S < (N-2 / N-1) * T then scale down
If the loads fluctuates around a boundary, the number of servers would keep going up and down. To prevent this from happening, a delay is introduced by the cloud hoster, but we should also lower the lower-bound by 5% or so to have some bandwidth in which the number of servers is stable. We can’t touch the upper bound, since it would violate the one server standby requirement.
Of course down-scaling can only occur for N>2, as we need a minimum of 2 servers. Up-scaling could occur unlimited, we only introduce a soft limit of 5 servers to limit the maximum costs, but if business needs change, we could up this limit.
From this follows that the load of N machines (when we have more than 2) will never be below N-2 / N-1, for example the load of 3 machines will never be below 1/2 (50%), the load of 4 machines will never be below 2/3 (67%) etc. The higher the load, the better the utilization becomes, and when the load drops, the extra machines are shut down to improve utilization again.
The result will be that depending on the fluctuations in your landscape, your average utilization percentage will increase even more, resulting in a lower running cost of your landscape.
How to achieve this auto-scaling behavior will be addressed in an upcoming post.