Introduction

Do you know Zalando? They sell fashion online, and with €2.2 billion turnover in 2014, they are one of Europe's most successful startups. Go visit their shop and order something. Your order will be processed by Camunda. As you can imagine that is a lot of orders, and it would not be exactly "awesome" if their order execution suddenly crashed - scalability and resilience are big topics at Zalando, which is why they use Camunda.

There are many examples of organizations all over the world, using Camunda in environments that demand scalability and resilience. Some are running Camunda on premise, some do it in the cloud, and some do both.

Here you can find an overview about the concepts and features that make Camunda "scale like hell":

Persistence Strategies

Clustering

Runtime vs. History

Persistence Strategies

Camunda runs on many different relational databases (see: supported environments). Camunda uses those databases as efficient as possible, applying the following concepts:

Compact Tables

Camunda uses a compact data model and sophisticated algorithms to minimize the number of rows necessary to store the state of process instances in the database. This makes execution faster by reducing the number of rows that need to be stored. In the best case, only a single row needs to be updated when advancing from one activity to the next.

Deadlock Avoidance

Camunda uses optimistic concurrency control to support high levels of concurrency while minimizing the risk of dead locks. Locks are never held during user think time. All modifications to database state are batch flushed at the end of a transaction while using intelligent SQL statement ordering to avoid circular waits.

Control Savepoints

When reaching a savepoint, the in-memory state is synchronized with the database to ensure fault-tolerance. Camunda provides fine granular control over the placement of savepoints,and thus allows you to balance fault tolerance and performance. For example, you can batch the execution of multiple activities in a single transaction and by this reduce the number of database synchronization points.

Intelligent Caching

Write-only savepoints are supported through reusing the 1st level cache in subsequent transactions. This substantially reduces the number of SELECT statements necessary to execute a sequece of activities with intermediary savepoints. This is most effective when implementing data-heavy processes like JSON or XML payload transformation

True Concurrency

Concurrent tokens are represented as individual rows in the database. This model allows for true intra process instance concurrency since seperate rows can be updated concurrently.

Clustering

You can run Camunda in a Cluster in order to achieve load balancing and/or high availability. For example, it is a very common use case to run Camunda in an active/active configuration.

Multiple Nodes, Shared Database

In order to provide load balancing or fail-over capabilities, the process engine can be distributed to different nodes in a cluster. Each process engine instance will then connect to a shared database.

The individual process engine instances do not maintain session state across transactions. Whenever the process engine runs a transaction, the complete state is flushed out to the shared database. This makes it possible to route subsequent requests which do work in the same process instance to different cluster nodes. This model is very simple and easy to understand and imposes limited restrictions when it comes to deploying a cluster installation. As far as the process engine is concerned there is also no difference between setups for load balancing and setups for fail-over (as the process engine keeps no session state between transactions).

As a consequence, it is extremely easy to set up HA configurations such as active/active nodes.

The process engine job executor is also clustered and runs on each node. This way, there is no single point of failure as far as the process engine is concerned. The job executor can run in both homogeneous and heterogeneous clusters.

Minimal resource allocation

Because the process engine is stateless, it allocates a minimal amount of RAM resource per node (typically less than 10 MB). Basically, you can have billions of process instances persisted in the database, without a necessary impact on resource allocation per node. This also means that you can run many process engine instances per node.

Runtime vs. History

Runtime Data is the minimal amount of data the Camunda process engines needs to persist in order to do asynchronous continuation - e.g. when the process engine waits for a user interaction (user task), incoming message (message event) or timespan (timer event), or when there are asynchronous service interactions (service task).

History Data are not necessary for execution, but may be logged for the sake of audits, reporting etcetera. They allow you inspect the complete audit trail of running and completed process instances, including their payload.

Camunda separates Runtime Data from History Data, which is a very powerful architectural concept for performance optimization.

History Event Stream

In addition to maintaining the runtime state, the process engine creates an audit log providing audit information about executed process instances. We call this event stream the history event stream. The individual events which make up this event stream are called History Events and contain data about executed process instances, activity instances, changed process variables and so forth. In the default configuration, the process engine will simply write this event stream to the Camunda history database. The HistoryService API allows querying this database.

Configurable Log Level

The history level controls the amount of data the process engine provides via the history event stream. A number of settings are available out of the box, like with or without payload (process variables), or "full" or "none". However, you can also create your very own log level configuration. For example, you could decide that only for certain processes of a certain kind a certai amount of data should be logged, e.g. when orders are processed that have a volume of more than $100 USD.

Big Data

Alternatively, you can reroute the history event stream to any target you like, including queues and what ever big data solutions you want to use. This is possible because the Camunda Process Engine component does not read any state from the history database.