Clean Log Data

In HAP Server, some log data is stored in MongoDB for a long time, which may lead to a large amount of this type of data occupying a significant portion of the database storage space.

You can use the show dbs command in MongoDB to check the size of each database and then use the command to calculate table size to identify tables that occupy a large amount of storage space.

We provide a solution for cleaning log data. Based on the set rules, data in relevant tables can be physically deleted. After the deletion is completed, log data for the corresponding time on the page will no longer be displayed. For example, execution records of workflows, records of approval processes (approval processed are also within the scope of workflows), logs of row records in worksheets, and request logs in the integration center.

Whitelisted Tables for Cleaning:

Database	Table Name	Direct Drop	Table Purpose
mdworkflow	code_catch	Yes	Cached data generated during code block execution
mdworkflow	hooks_catch	Yes	Temporary cache data for triggers
mdworkflow	webhooks_catch	Yes	Cached data generated during Webhook execution
mdworkflow	wf_instance	No	Associated data for main workflow execution history
mdworkflow	wf_subInstanceActivity	No	Associated data for sub-workflow execution history
mdworkflow	wf_subInstanceCallback	No	Associated data for sub-workflow execution history
mdworkflow	app_multiple_catch	No	Data stored when "Direct access" is checked in the "Get Multiple Data" node
mdworkflow	custom_apipackageapi_catch	No	Data returned from calling API integration interface
mdworksheetlog	wslog*	Yes	Log of row records in the corresponding month The format of the worksheet name is wslog+date (e.g., wslog202409)
mdintegration	wf_instance	No	Integration center - request logs
mdintegration	wf_instance_relation	No	Integration center - associated data for request logs
mdintegration	webhooks_catch	No	Integration center - data corresponding to "View details" in request logs
mdintegration	code_catch	No	Integration center - data corresponding to "View details" in request logs
mdintegration	json_catch	No	Integration center - data corresponding to "View details" in request logs
mdintegration	custom_parameter_catch	No	Integration center - data corresponding to "View details" in request logs
mdservicedata	al_actionlog*	Yes	Store the application behavior logs for the corresponding month The format of the worksheet name is al_actionlog+date（e.g., al_actionlog202409）
mdservicedata	al_uselog	No	Log for storage usage analysis

Tables that can directly drop are recommended to be deleted using the db.collection.drop() command because dropping them will release the storage space occupied by the corresponding table directly.

For example, the following operation will delete the code_catch table under the mdworkflow database:
```
use mdworkflow
db.code_catch.drop()
```
For tables that cannot directly drop, refer to the following steps to configure a cleanup task.

Configure Data Cleanup Task

Download the mirror (offline package download)
```
docker pull nocoly/hap-archivetools:1.0.4
```

Create a config.json configuration file with the following example content:

[
    {
        "id": "1",
        "text": "Description",
        "start": "2023-05-31 16:00:00",
        "end": "2023-06-30 16:00:00",
        "src": "mongodb://root:password@192.168.1.20:27017/mdworkflow?authSource=admin",
        "archive": "",
        "table": "wf_instance",
        "delete": true,
        "batchSize": 500,
        "retentionDays": 0
    },
    {
        "id": "2",
        "text": "Description",
        "start": "2023-05-31 16:00:00",
        "end": "2023-06-30 16:00:00",
        "src": "mongodb://root:password@192.168.1.30:27017/mdworkflow?authSource=admin",
        "archive": "",
        "table": "wf_subInstanceActivity",
        "delete": true,
        "batchSize": 500,
        "retentionDays": 0
    }
]

According to the above configuration file format, adjust or add configuration content to clean the data tables as needed.
Note: The time specified in the configuration file is in Coordinated Universal Time (UTC).
- UTC: 2023-05-31 16:00:00
  - Converted to UTC+8 (East 8th Zone) time: 2023-06-01 00:00:00 (2023-05-31 16:00:00 + 8 hours)
- UTC: 2023-06-30 16:00:00
  - Converted to UTC+8 (East 8th Zone) time: 2023-07-01 00:00:00 (2023-06-30 16:00:00 + 8 hours)

Parameter Description:

"id": "Task Identifier ID",
"text": "Description",
"start": "Specify the start time of the archived data, in UTC time zone (if the value of retentionDays is greater than 0, this configuration will automatically become invalid), delete data greater than or equal to this time.",
"end": "Specify the end time of the archived data, in UTC time zone (if the value of retentionDays is greater than 0, this configuration will automatically become invalid), delete data before this time.",
"src": "Connection address of the source database",
"archive": "Connection address of the target database (if empty, no archiving will be done, only deletion according to the set rules)",
"table": "Data table",
"delete": "It is fixed to true; after the archiving task is completed, and the number of records verified is correct, clean up the archived data in the source database",
"batchSize": "Number of entries and deletions in a single batch",
"retentionDays": "It defaults to 0. If greater than 0, it means delete data X days ago and enable scheduled deletion mode, the dates specified in start and end will automatically become invalid, scheduled to run every 24 hours by default"

Start the archiving service by executing the following in the directory where the config.json file is located
```
docker run -d -it -v $(pwd)/config.json:/usr/local/MDArchiveTools/config.json  -v /usr/share/zoneinfo/Etc/GMT-8:/etc/localtime nocoly/hap-archivetools:1.0.4
```
Other:
- Resource Usage: During program operation, there will be a certain amount of resource pressure on the source database, target database, and the device where the program is located. It is recommended to execute in the idle period of the business.
- Viewing Logs:
  - Running in the background (default): Use docker ps -a to find the container ID, then execute docker logs container ID to view the logs.
  - Running in the foreground: Remove the -d parameter, and the logs will be output in real-time to the terminal for easy progress tracking.
- Scheduled Tasks:
  - Set execution interval: You can modify the execution interval in milliseconds by customizing the ENV_ARCHIVE_INTERVAL variable, with a default value of 86400000.
- Reclaim Disk Space: When data is deleted using the cleanup tool, the disk space occupied by the deleted data is not immediately released, but it is usually reused by the same table.

Configure Data Cleanup Task​

Configure Data Cleanup Task