View metrics in the self-diagnostics service

On the page:

General information

The web interface of the self-diagnostics service is available for monitoring the system status and analyzing its performance.

Using the service, you can:

View system metrics of the server.
Analyze resource usage (CPU, memory).
Control cameras and detectors.
Monitor archive state.
Run queries to metrics using the Prometheus Query Language (PromQL).

Metrics can be displayed:

As a table (current values).
As graphs for a selected time period.

Access to the self-diagnostics service

To go to the monitoring interface:

Open the web browser.
In the address bar, enter: http://127.0.0.1:20040/.
Press the Enter button.

Interface and queries execution

The interface allows you to run queries to metrics and analyze their values.

To run a query:

Enter a metric in the Enter expression field.

Note
To view the metrics available in the Enter expression field, click the button → Explore metrics.
If necessary, specify a time span.
Click the Execute button.

Complex queries can be executed using PromQL.

Basic query options

The basic options for executing queries are listed in the table:

Options	Description
Usage of several metrics	You can use several metrics in one query
Filter by parameters	You can filter metrics by parameters (labels) using curly brackets. For example: ngp_fps{ep_name=~"hosts/TEST/DeviceIpint.2/SourceEndpoint.video:0:0"} In this case, FPS values are displayed only for the specified source
Usage of logical and arithmetic operators to find anomalies	In queries, you can use: Arithmetic operators. Logical operators. Prometheus functions. For example: ngp_fps < 17 This query allows you to find sources with a frame rate below 17 FPS. For a full list of logical and arithmetic operators, see the Prometheus official documentation

View query results

You can view query results in two modes:

Table:
- Displays current metric values as a table.
- Shows current metric values.
- Updates when you change the time span.
Graph:
- Allows you to plot a graph of metric changes over time.
- You can set the time period for plotting a graph.
- You can set the end point of a graph.
- You can set the interval between data points.
- You can also enable stacking mode, which fills the area under the graph (the Unstacked/Stacked options).

The main metrics of the service

Below are the main metrics available in the self-diagnostic service.

Metric	Description
Metrics of system status
ngp_cpu_total_usage	The CPU load of the server
Archive metrics
ngp_archive_channel_fps	The frame rate of all cameras when recording to the archive
ngp_archive_volume_size	The current total size of the archive (in bytes)
Metrics of cameras and video analytics
ngp_fps	The frame rate of all cameras, detectors, and decoders
ngp_people_count	The last captured number of people in the frame by the Crowd estimation VA detector
ngp_errors	Number of errors in the operation of detectors: Neural tracker Neural counter Area occupancy detector Human pose detector Neural classifier Stopped object detector Smoke detector Fire detector Water level detector with a neural network Barcode detector Meta-detector Person-based privacy masking Object tracker (with a Neural filter) Crowd estimation VA
ngp_skipped_pp	Number of skipped frames by the Crowd estimation VA detector due to the lack of resources for processing
Metrics of system status
ALERTS_FOR_STATE	Found and fixed malfunctions. Contains the alertname parameter with the problem type. Example ALERTS_FOR_STATE{alertname="ipint_is_not_activated",ep_name="hosts/Server1/DeviceIpint.99",instance="127.0.0.1:20108",job="ngp_exporter",ngp_alert="true"} Decryption of the alertname values (see General information about the self-diagnostics service) for the ALERTS_FOR_STATE metric: low_os_memory—out of RAM. ipint_is_not_activated—camera is connected, but there is no data from it. no_samples_in_detector—no events from the detector. restart_services_when_archive_source_not_activated—the recording to archive isn't working. restart_services_when_no_samples_in_archive—recording to archive with 0 FPS. restart_services_when_no_ping_from_detector_to_archive—no recording to the archive on event from the detector. logs_disk_space_is_low / db_disk_space_is_low—out of system disk space
Metrics of disk status (SMART)
smartctl_device_smart_status	General disk status. The main metric values: 1—the disk is in operational state. 0—the disk has reported a failure and has already failed, or is predicted to fail within the next 24 hours. In such cases, we recommend checking: Logs of metric exporter. Access permissions to devices. Correct operation of smartctl.
smartctl_device_attribute	Contains detailed SMART attributes of disks. There are several value types: raw—the actual attribute value without interpretation. thresh—the threshold value above which the attribute is considered problematic. If the raw value exceeds the thresh value, this indicates a potential device failure. value—the current normalized value of the attribute. Typically ranges from 1 to 100 or from 1 to 253. Used to represent the status of a device in a convenient form. worst—the worst normalized value recorded during the device's operation. Used to analyze disk status degradation. Example of interpretation: When the smartctl_device_attribute metric is analyzed, the attribute values can look like this: raw: 15 (the actual count of reallocated sectors). thresh: 50 (the threshold at which the disk is considered unreliable). value: 55 (the current normalized attribute status). worst: 50 (the worst recorded attribute status). Usage in monitoring: raw: used for detailed analysis and diagnostics. thresh: critical for setting up warnings. value and worst: used to monitor the device status

Examples of useful queries for Windows OS

The CPU load graph (analog of the System monitor):

sum by (process_id) (100 / scalar(wmi_cs_logical_processors) * (irate(wmi_process_cpu_time_total{process="AppHost"}[10m]))) or ngp_cpu_total_usage

RAM usage by the AppHost processes and a total memory space:

sum by (process_id) (avg_over_time(wmi_process_working_set{process="AppHost"}[5m])) / 1024 or avg_over_time(wmi_os_virtual_memory_bytes[5m]) / 1024

The percentage of RAM usage:

100.0 - 100 * avg_over_time(wmi_os_virtual_memory_free_bytes[5m]) / avg_over_time(wmi_os_virtual_memory_bytes[5m])

Examples of useful queries for Linux OS

The total RAM usage by the AppHost processes:

sum by (groupname) (namedprocess_namegroup_memory_bytes{memtype="resident"})

The percentage of RAM usage:

100 - node_memory_MemAvailable_bytes * 100 / node_memory_MemTotal_bytes

The CPU load by the AppHost processes as a percentage:

sum by (object_id) (rate(namedprocess_namegroup_cpu_seconds_total{groupname="AppHost"}[1m])) * 100

The total CPU load as a percentage:

100 * avg without (cpu) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))

RAM usage by the AppHost processes to detect the memory leak:

namedprocess_namegroup_memory_bytes{object_id=~"APP_HOST.*",memtype="proportionalResident"}

Page tree

Documentation for Axxon One 3.0 version. Documentation for Axxon One 2.0 version is also available.