Skip to main content

Alert Rules

Configuration Reminder

The configuration of the alert feature involves multiple aspects including monitoring data queries, alert logic settings, and notification policy management. It is recommended that professional Ops personnel handle this task. Proper alert rules can help the team identify and respond to system anomalies in a timely manner, while inappropriate configurations may result in excessive noise, false positives, and missed alerts.

Before configuring the alert rules, please read this help document carefully to ensure a thorough understanding of the functions and best practices for each configuration. It is generally recommended to read the entire document 2-3 times to become familiar with the processes and key configurations before starting. For complex monitoring scenarios, it is advisable to first validate in a test environment and confirm that the rules are accurate before applying them to the production environment to ensure the accuracy and reliability of monitoring alerts.

Configuration Item Details

Below is a detailed explanation of each configuration item in the alert rules, their functions, meanings, and best practice recommendations to help users configure and manage alert rules more efficiently.


1. Enter alert rule name

  • Description: Enter a unique and descriptive name to identify the alert rule.
  • Recommendation: Use a readable name that reflects the monitoring object or business scenario, such as prod_cpu_high_usage or Production Environment Disk High Usage Rate.

2. Define query and alert condition

Alert rules generally consist of three steps, with each step processing data until it is determined whether to trigger an alert:

2.1 Step A – Query raw data

  • Action: Use PromQL to query time series data from the data source.
  • Recommendation: Ensure the query statement is accurate and returns the required key metric data.

2.2 Step B – Reduce

  • Function: Reduce the time series data returned in Step A to a single value.
  • Configuration Items:
    • Input: Reference the query result from Step A.
    • Function: Select Last, which takes the latest data value.
    • Mode: Use Strict mode, which requires valid input data; otherwise, no calculation is performed.
  • Recommendation: Choose the appropriate reduction function (e.g., Last, Avg, Max) to ensure that the calculation result accurately reflects the current state.

2.3 Step C – Threshold

  • Function: Determine whether the value obtained in Step B meets the conditions to trigger an alert.
  • Configuration Items:
    • Input: Reference the reduced result from Step B.
    • Condition: For example, trigger an alert when the value “Is above” a set threshold.
    • Optional: Set a custom recovery threshold to prevent alert flapping.
  • Recommendation: Set the threshold reasonably based on the characteristics of the monitoring object and historical data to avoid false positives or missed alerts.

3. Set evaluation behavior

3.1 Folder

  • Function: Logically categorize alert rules for easy management across different businesses or teams.
  • Configuration Items:
    • Choose: Select from existing folders, such as General.
    • New folder: Create a new folder with names based on the business, environment (production/test), or team, such as Prod.
  • Recommendation: Use standardized naming rules to facilitate permission management and maintenance.

3.2 Evaluation Group and Interval

  • Function: Define the evaluation frequency of the alert rules and resource allocation strategy to balance system performance and real-time requirements.
  • Key Concepts:
    • Evaluation group and interval: A set of alert rules sharing the same evaluation interval for unified scheduling and optimized resource usage.
    • New evaluation group: Create a new group with names based on business scenarios, the criticality of alert rules, and evaluation intervals for ease of understanding and maintenance.
      • Here are some simple evaluation group naming examples based on priority:
        • Critical-1m
        • Standard-5m
        • High-15m
  • Recommendation:
    • For critical metrics with high real-time requirements (such as CPU usage, memory utilization), use shorter intervals (e.g., 1m).
    • For metrics with lower real-time requirements, use longer intervals (e.g., 15m) to reduce resource consumption.

3.3 Pending Period

  • Function: Require the alert condition to persist for a certain period before triggering an alert to prevent false positives due to transient fluctuations.
  • Configuration Options:
    • None: Trigger the alert immediately once the condition is met, suitable for scenarios with very low latency requirements.
    • 1m/2m/.../5m: The condition must persist for the specified time before triggering an alert.
  • Recommendation:
    • For metrics easily affected by transient fluctuations (such as network jitter), set a pending period of 1-2 minutes.
    • For critical system metrics, balance latency and accuracy to choose an appropriate time window.
    • Generally, keep the default of 1 minute unless there are specific requirements.

4. Configure labels and notifications

4.1 Labels

  • Function:
    • Add metadata to alerts (e.g., severity=high, team=ops) for subsequent search, filter, and alert routing.
  • Recommendation (if no special requirements, omit configuration):
    • Establish a uniform label specification to ensure consistency across different rules.
    • Use labels to classify alerts by business, urgency, etc.

4.2 Notifications

  • Function: Define the method and channel for receiving notifications when an alert is triggered.
  • Key Configuration Items:
    • Contact Point: Select or configure the specific notification channel (e.g., Email, DingDing).
    • Mute Timings, Grouping, and Override Timings: Precisely control notification suppression, grouping, and sending frequency.
  • Recommendation:
    • Regularly check and update notification channels to ensure responsible personnel receive alert information timely.
    • Use grouping and suppression strategies to avoid a flood of duplicate notifications due to the same issue.

5. Configure notification message

5.1 Summary (Optional)

  • Function: Summarize the core issue of the alert in one sentence for quick identification and response.
  • Recommendation: Keep it concise and highlight the key issue.

5.2 Description (Optional)

  • Function: Provide a detailed description of the monitoring object, detection logic, and potential impact range of the alert rule for subsequent fault diagnosis.
  • Recommendation: Include contextual information, historical trends, or suggested initial troubleshooting steps.

5.3 Runbook URL (Optional)

  • Function: Link to detailed emergency response document guiding the fault handling process.
  • Recommendation: Ensure the link is valid and the runbook content is up-to-date.

5.4 Add Custom Annotation

  • Function: Extend the context information of alert messages with additional key-value pairs, such as responsible person, service name, etc.
  • Recommendation: Add necessary annotations as per actual needs to facilitate subsequent automated processing and archiving.

5.5 Link Dashboard and Panel

  • Function: Provide direct links to related monitoring dashboards or panels in the alert notification to speed up fault localization.
  • Recommendation: Ensure the links are up-to-date and correspond to the actual configuration in Grafana.

Configuration Example

Below is a simple and practical configuration example for High Disk Usage, helping users quickly get started with alert rule configuration.


1. Enter alert rule name

  • name: High Disk Usage

2. Define query and alert condition

  • PromQL:

    100 * ((node_filesystem_size_bytes{fstype=~"xfs|ext4", mountpoint=~"/|/data"} - node_filesystem_avail_bytes{fstype=~"xfs|ext4", mountpoint=~"/|/data"}) / node_filesystem_size_bytes{fstype=~"xfs|ext4", mountpoint=~"/|/data"})
    • The meaning of this PromQL is to query disk usage rates for paths / and /data.
  • Alert Threshold: 90

3. Set evaluation behavior

  • Folder: Select a folder or create one if not available, such as Prod.

  • Evaluation Group and Interval: Choose a group or create one if not available, for example:

    • Evaluation group name: Standard-5m.
    • Evaluation interval: 5m.
  • Pending period: Set the pending period, not lower than the evaluation interval, such as 5m.

4. Configure labels and notifications

  • Contact point: Select the alert receiving channel (pre-configure notification methods)

5. Configure notification message

  • Summary: Set a one-sentence summary:

    Disk Device: {{ $labels.device }}, Mount Point: {{ $labels.mountpoint }}, Current Usage Rate: {{ printf "%.2f" $values.B.Value }} %

6. After configuration, click the top-right Save rule and exit to save and exit.

Common PromQL

High CPU Usage

PromQL:

100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance))
  • Threshold Example: 90

Summary:

Current CPU Usage Rate: {{ printf "%.2f" $values.B.Value }} %

High Memory Usage

PromQL:

(1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100
  • Threshold Example: 90

Summary:

Current Memory Usage Rate: {{ printf "%.2f" $values.B.Value }} %

High Disk Usage

PromQL:

100 * ((node_filesystem_size_bytes{fstype=~"xfs|ext4", mountpoint=~"/|/data"} - node_filesystem_avail_bytes{fstype=~"xfs|ext4", mountpoint=~"/|/data}) / node_filesystem_size_bytes{fstype=~"xfs|ext4", mountpoint=~"/|/data})
  • Threshold Example: 90

Summary:

Disk Device: {{ $labels.device }}, Mount Point: {{ $labels.mountpoint }}, Current Usage Rate: {{ printf "%.2f" $values.B.Value }} %

Kafka Consumer Group Lag

PromQL:

sum(kafka_consumergroup_lag{}) by (consumergroup, topic)
  • Threshold Example: 10000

Summary:

Consumer Group: {{ $labels.consumergroup }}, Topic: {{ $labels.topic }}, Current Lag: {{ printf "%.2f" $values.B.Value }}