Skip to main content

Continuous Queuing of Workflows

After a workflow event is triggered, it enters the Kafka message queue in the form of a message, and is then consumed by the workflow consumption service for processing.

When encountering continuous queuing of workflows, it is necessary to first determine whether the workflow is queuing and being consumed, or if it is simply queuing without being consumed.

Consider the following two scenarios:

Workflow queued and consumed

Check the monitoring page of the workflow to see if there are workflows with a very large queue, such as tens of thousands or millions of workflows. This may be due to misconfigured trigger logic by business personnel or loops causing a high volume of workflow triggers.

Consumption capacity may be affected by various factors. For example, while the system can normally process ten thousand workflows per minute, there may be instances where hundreds of thousands of workflows are triggered, resulting in a large number of workflows being queued up without being quickly consumed.

For example, when a few workflows have a queue of tens of thousands or more, causing severe queuing for all workflows, if it is confirmed that these workflows with large queues do not need to be processed again, they should be directly closed in a non-paused state.

  • When a workflow is closed directly, the workflows in the queue will be quickly consumed (without going through the nodes logic in the workflow), allowing them to be processed as soon as possible.

  • If the workflows with a large queue need to be consumed, they can be paused first and then re-started during business downtime.


If the total number of workflows in queuing is large but the number of a certain workflow in the queue is small, it may be due to the following reasons:

  • Instances of microservices have high resource usage. If the services are deployed in a cluster mode based on Kubernetes, you can dynamically scale up the resource-intensive service instances when there is a certain amount of redundancy in the resources.

  • Slow queries in the MongoDB database, which is often indicated by high CPU usage in the mongodb process and the presence of numerous slow queries in the mongodb.log. Go to slow query optimization for more details.

The above two points may also lead to rebalancing of Kafka consumer groups. This is mainly caused by slow processes resulting in timeouts, or consumer service instances being affected by resource issues and restarting, which in turn triggers consumer group rebalancing.

You can use the following command to check the actual message accumulation status in each topic partition and whether the system is currently in a rebalancing state:

  1. Enter the container of the storage component

    docker exec -it $(docker ps | grep hap-sc | awk '{print $1}') bash
  2. Execute the command to view the consumption of the md-workflow-consumer consumer group

    /usr/local/kafka/bin/kafka-consumer-groups.sh --bootstrap-server ${ENV_KAFKA_ENDPOINTS:=127.0.0.1:9092} --describe --group md-workflow-consumer
    • If prompted with Error: Executing consumer group command failed due to null, you can click to download the Kafka installation package, then upload the installation package to the deployment server and copy it into the hap-sc container. After that, unzip the file and use the bin/kafka-consumer-groups.sh from the new installation package to execute the above command.

The normal output is as follow. If prompted with Warning: Consumer group 'md-workflow-consumer' is rebalancing., it means the consumer group is currently rebalancing.

  • LAG 列代表当前 Topic 分区的消息堆积数量 The LAG column represents the current message backlog of the Topic partition.

  • Commonly used Topic names in workflows:

    • WorkFlow:Main workflow execution

    • WorkFlow-Process:Sub-workflow execution

    • WorkFlow-Router:Slow queue for workflow execution

    • WorkFlow-Batch:Bulk workflow execution

    • WorkFlow-Button:Button-triggered workflow execution

    • WorkSheet:Row record validation for triggering workflows

    • WorkSheet-Router:Slow queue row record validation for triggering workflows

Workflow queued but not consumed

When the workflow is continuously queued but not consumed, it is usually due to a full server disk or an issue with the Kafka service.

Check if the Kafka service is running properly:

Check the health check logs of the storage component container.

docker logs $(docker ps | grep hap-sc | awk '{print $1}')
  • If the log output is normal, it will be all INFO. If the Kafka service keeps restarting continuously, it means the current Kafka service is abnormal. You can try restarting the service as a whole first. If Kafka still cannot start, clear the Kafka error data.

If the triggered workflows were unable to be written to the Kafka queue due to reasons such as a full disk or Kafka service issues, the history "queued" workflows triggered will no longer be consumed.