Skip to main content

Service Health Check

Standalone Mode

When there are partial functional abnormalities or system-wide access issues, you can follow the steps below for sequential troubleshooting.

Container Service Log Check

Check Health Check Logs of Microservice Application Containers

docker logs $(docker ps -a | grep -E 'hap-community' | awk '{print $1}')
Check Health Check Logs of Storage Component Containers
docker logs $(docker ps -a | grep hap-sc | awk '{print $1}')

Log Analysis and Troubleshooting Methods:

  • Normal: Logs are mainly at the INFO level and update steadily in a rolling manner.

  • Abnormal: Continuous ERROR logs or stack trace information indicates issues requiring targeted analysis.

    • Kafka Error: If storage component logs indicate that Kafka failed to start, refer to the Kafka Startup Failure Troubleshooting Steps.

    • MongoDB Error: If the logs show that MongoDB has experienced automatic restarts, it is typically caused by server memory overload. A temporary restart of the HAP service can often resolve the issue.

    • Microservice Error: If storage component logs are normal but microservice application logs are abnormal, attempt to resolve the issue by restarting the HAP service.

Restarting the HAP Service

Run the following command in the directory where the installation manager is extracted:

bash service.sh restartall
  • If the path of the service.sh file is forgotten, use the command below to locate it:

    find / -path /proc -prune -o -name "service.sh" -print

Server Physical Resource Check

CPU Usage Check

top -c
  • In standalone mode, a 16-core CPU is sufficient for most scenarios. If the CPU is still fully utilized and the process with the highest CPU usage is the mongod process, this is usually caused by slow queries. Refer to the Slow Query Optimization Documentation.

  • The wa field in the CPU metrics of the top command represents disk wait time. Normally, it should be 0 or 0.x. If it reaches 5 or higher, it indicates poor disk performance, and switching to SSD disks is strongly recommended.

Memory Usage Check

free -h
  • When memory usage is close to capacity, it can easily trigger system anomalies, which may also lead to abnormally high CPU usage.
  • If memory utilization is excessively high even in environments with 64GB or more memory, use the top -co %MEM command to sort processes by memory usage percentage and identify problematic processes.

Disk Space Check

df -Th
  • Full capacity in data partitions will cause system functionalities to become unavailable.
  • Refer to documentation for cleaning up old images, deleting redundant log data, or expanding disk space. Afterward, restart the service to restore functionality.

Historical Resource Usage Trend Check

System anomalies may sometimes exhibit latency, so it is necessary to review historical resource trends for retrospective analysis.

  • Existing Monitoring: If monitoring tools (such as Zabbix, Prometheus, etc.) are already installed on the server, prioritize reviewing trends in CPU, memory, and I/O metrics during the fault timeframe.

  • Additional Installation: If no monitoring mechanisms are currently in place, it is recommended to install an Ops Platform to achieve real-time monitoring and historical data analysis of system resources.