Go to documentation repository
The web interface of the self-diagnostics service is available for monitoring the system status and analyzing its performance.
Using the service, you can:
Metrics can be displayed:
To go to the monitoring interface:
The interface allows you to run queries to metrics and analyze their values.
To run a query:
Note
To view the metrics available in the Enter expression field, click the button → Explore metrics.
Complex queries can be executed using PromQL.
The basic options for executing queries are listed in the table:
| Options | Description |
|---|---|
| Usage of several metrics | You can use several metrics in one query |
| Filter by parameters | You can filter metrics by parameters (labels) using curly brackets. For example: ngp_fps{ep_name=~"hosts/TEST/DeviceIpint.2/SourceEndpoint.video:0:0"}
In this case, FPS values are displayed only for the specified source |
| Usage of logical and arithmetic operators to find anomalies | In queries, you can use:
For example: ngp_fps < 17 This query allows you to find sources with a frame rate below 17 FPS. For a full list of logical and arithmetic operators, see the Prometheus official documentation |
You can view query results in two modes:
Below are the main metrics available in the self-diagnostic service.
Metric | Description |
|---|---|
Metrics of system status | |
ngp_cpu_total_usage | The CPU load of the server |
Archive metrics | |
ngp_archive_channel_fps | The frame rate of all cameras when recording to the archive |
ngp_archive_volume_size | The current total size of the archive (in bytes) |
Metrics of cameras and video analytics | |
ngp_fps | The frame rate of all cameras, detectors, and decoders |
ngp_people_count | The last captured number of people in the frame by the Crowd estimation VA detector |
ngp_errors | Number of errors in the operation of detectors: |
ngp_skipped_pp | Number of skipped frames by the Crowd estimation VA detector due to the lack of resources for processing |
Metrics of system status | |
ALERTS_FOR_STATE | Found and fixed malfunctions. Contains the alertname parameter with the problem type. Example ALERTS_FOR_STATE{alertname="ipint_is_not_activated",ep_name="hosts/Server1/DeviceIpint.99",instance="127.0.0.1:20108",job="ngp_exporter",ngp_alert="true"}
Decryption of the alertname values (see General information about the self-diagnostics service) for the ALERTS_FOR_STATE metric:
|
| Metrics of disk status (SMART) | |
| smartctl_device_smart_status | General disk status. The main metric values:
In such cases, we recommend checking:
|
| smartctl_device_attribute | Contains detailed SMART attributes of disks. There are several value types:
Example of interpretation: When the smartctl_device_attribute metric is analyzed, the attribute values can look like this:
Usage in monitoring:
|
sum by (process_id) (100 / scalar(wmi_cs_logical_processors) * (irate(wmi_process_cpu_time_total{process="AppHost"}[10m]))) or ngp_cpu_total_usage
sum by (process_id) (avg_over_time(wmi_process_working_set{process="AppHost"}[5m])) / 1024 or avg_over_time(wmi_os_virtual_memory_bytes[5m]) / 1024
100.0 - 100 * avg_over_time(wmi_os_virtual_memory_free_bytes[5m]) / avg_over_time(wmi_os_virtual_memory_bytes[5m])
sum by (groupname) (namedprocess_namegroup_memory_bytes{memtype="resident"})
100 - node_memory_MemAvailable_bytes * 100 / node_memory_MemTotal_bytes
sum by (object_id) (rate(namedprocess_namegroup_cpu_seconds_total{groupname="AppHost"}[1m])) * 100
100 * avg without (cpu) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
namedprocess_namegroup_memory_bytes{object_id=~"APP_HOST.*",memtype="proportionalResident"}