Service Health Check
Standalone Mode
When there are partial functional abnormalities or system-wide access issues, you can follow the steps below for sequential troubleshooting.
Container Service Log Check
Check Health Check Logs of Microservice Application Containers
docker logs $(docker ps -a | grep -E 'hap-community' | awk '{print $1}')
Check Health Check Logs of Storage Component Containers
docker logs $(docker ps -a | grep hap-sc | awk '{print $1}')
Log Analysis and Troubleshooting Methods:
-
Normal: Logs are mainly at the INFO level and update steadily in a rolling manner.
-
Abnormal: Continuous ERROR logs or stack trace information indicates issues requiring targeted analysis.
-
Kafka Error: If storage component logs indicate that Kafka failed to start, refer to the Kafka Startup Failure Troubleshooting Steps.
-
MongoDB Error: If the logs show that MongoDB has experienced automatic restarts, it is typically caused by server memory overload. A temporary restart of the HAP service can often resolve the issue.
-
Microservice Error: If storage component logs are normal but microservice application logs are abnormal, attempt to resolve the issue by restarting the HAP service.
-
Restarting the HAP Service
Run the following command in the directory where the installation manager is extracted:
bash service.sh restartall
-
If the path of the
service.shfile is forgotten, use the command below to locate it:find / -path /proc -prune -o -name "service.sh" -print
Server Physical Resource Check
CPU Usage Check
top -c
-
In standalone mode, a 16-core CPU is sufficient for most scenarios. If the CPU is still fully utilized and the process with the highest CPU usage is the
mongodprocess, this is usually caused by slow queries. Refer to the Slow Query Optimization Documentation. -
The
wafield in the CPU metrics of thetopcommand represents disk wait time. Normally, it should be 0 or 0.x. If it reaches 5 or higher, it indicates poor disk performance, and switching to SSD disks is strongly recommended.
Memory Usage Check
free -h
- When memory usage is close to capacity, it can easily trigger system anomalies, which may also lead to abnormally high CPU usage.
- If memory utilization is excessively high even in environments with 64GB or more memory, use the
top -co %MEMcommand to sort processes by memory usage percentage and identify problematic processes.
Disk Space Check
df -Th
- Full capacity in data partitions will cause system functionalities to become unavailable.
- Refer to documentation for cleaning up old images, deleting redundant log data, or expanding disk space. Afterward, restart the service to restore functionality.
Historical Resource Usage Trend Check
System anomalies may sometimes exhibit latency, so it is necessary to review historical resource trends for retrospective analysis.
-
Existing Monitoring: If monitoring tools (such as Zabbix, Prometheus, etc.) are already installed on the server, prioritize reviewing trends in CPU, memory, and I/O metrics during the fault timeframe.
-
Additional Installation: If no monitoring mechanisms are currently in place, it is recommended to install an Ops Platform to achieve real-time monitoring and historical data analysis of system resources.