Performance Metrics
Overview
Engine can be configured to record and report on a number of metrics that cover various aspects of its performance. By default, this feature is disabled, but you can enable it by adding a setting called EnableRecordingMetrics
to your Engine config file with a value of true
. Once this setting is enabled, you can analyze those metrics by querying /RusticiEngine/metrics
. The information is reported using a Prometheus exporter, so we recommend that you analyze this data using a Prometheus server or a similar tool. For more information, you can refer to the Prometheus website.
Preface: Histogram Timers
Most of our metrics are recorded using histogram-backed timers. This means that each timer has multiple different labels, and each label identifies the bucket of the histogram being described. A single timer will show up on multiple lines of the metric report. Let's use the timer that monitors the response time of every single HTTP request that Engine receives as our example:
http_all_requests_duration_seconds_bucket{le="0.01"} 20
http_all_requests_duration_seconds_bucket{le="0.05"} 42
http_all_requests_duration_seconds_bucket{le="0.1"} 50
...etc...
http_all_requests_duration_seconds_bucket{le="60"} 56
http_all_requests_duration_seconds_bucket{le="+inf"} 56
http_all_requests_duration_seconds_sum 10.37343612
http_all_requests_duration_seconds_count 219
Metrics ending in _bucket
are used to categorize the various response times of the API. The le
label is short for "less than or equal to", so a _bucket
metric with a label of le="0.25"
specifies the number of requests that Engine responded to in less than 0.25 seconds. These buckets are cumulative, so if a request lands in the le="0.25"
bucket, then it will also contribute to the le="0.5"
and le="1"
buckets, all the way up to le="+inf"
. The _sum
and _count
metrics measure the total amount of time spent responding to given endpoint and the number of requests Engine has received to that endpoint.
Course Runtime
To see the response times of the endpoints used by the player, you can review course_runtime_operations_duration_seconds
. Along with the usual le
label, this metric also uses a label called operation
, which has the following possible values:
launch_page
- Used when launching a registration. All learning standards will use this page.player_configuration
- This is the endpoint called by the player after being launched. This is used by all learning standards, except for Tin Can courses when the settingUsePlayerForTinCanLaunches
is set to false (the default value in Engine 2018+ is true).record_results
- Called to record a registration's state during the course of a SCORM course.record_results_on_player_exit
- Called when a learner exits a SCORM course.process_aicc_request
- Used to perform various operations during AICC coursescmi5_au_launch
- Only used when launching cmi5 courseslti_launch
- Only used when launching LTI linkspersist_runtime_data
- The saving of course runtime data to the database
Here is an example:
course_runtime_operations_duration_seconds_bucket{le="0.01", operation="player_configuration"} 128
The API
We keep recording metrics covering the requests to both versions of the API. The method
label describes the HTTP method used to request the URL, and the url
label contains the URL that was requested with any identifiers replaced with [id]
. Aside from the name used for the metric and the possible values of the url
label, versions 1 and 2 of the API are presented identically. They are formatted like this:
api_v2_requests_duration_seconds_bucket{method="get", le="0.01", url="/registrations/[id]"} 28
API Counters
In addition to the above histograms, Engine also records counters to keep track of the number of requests made to/by each tenant and the status codes used in its responses to API requests:
api_requests_total{tenant="default"} 3
api_responses_total{status_code="200"} 328
API Import Course Jobs
Every time a course is imported, we track how long it takes to do so under the api_import_course_jobs
name. This metric has a label titled 'type', which designates the type of import as uploaded_zip
, ad_hoc_import
, import_by_manifest
, file_path
, or zip_url
. An example metric for the 10-second bucket for ad-hoc imports follows:
api_import_course_jobs_duration_seconds_bucket{le="10", type="ad_hoc_import"} 1
Webhooks
In the past, Engine only officially supported postback updates for two things: registration rollups and import job status. In Engine 22, we introduced our new subscription system, allowing you to subscribe to Engine event notifications and receive postbacks to a URL of your choosing.
Although we strongly advise you use the new subscription system, Engine still is capable of using the legacy system. As such, the following metrics are used for both the new subscription system and the legacy postback system:
api_postback_successful_responses
:postbackType
label with Queue ID as stringapi_postback_failed_responses
:postbackType
label with Queue ID as stringapi_postback_queue_size
:type
label with Queue ID as stringapi_postback_queue_items
:type
label with Queue ID as string
Example:
api_postback_queue_size{type="rollup_reg"} 0
api_postback_queue_items{type="rollup_reg"} 0
Modern Webhooks Metrics
sub_postback
:type
label with potential valuesend_message
Example:
sub_postback_duration_seconds_bucket{le="30", type="send_message"} 1
Legacy Postback Metrics
api_postback
:operation
label with potential valuesimport_results
andreg_rollup
Example:
api_postback{le="30", operation="import_results"} 1
The LRS
The performance of Engine's LRS can also be monitored. The metrics reporter uses the method
label to differentiate the different xAPI resources being requested. For a detailed description of each of these endpoints, please refer to the xAPI spec. The values currently used for the method
label are:
activities
activities_profile
activities_state
actors
actor_profile
extended_actions
statements
Example:
xapi_requests_duration_seconds_bucket{method="activities", le="0.01"} 17
xAPI Requests by Version
xAPI requests are tracked by version and sorted into three seperate buckets (listed below). These metrics include a method
label for what HTTP method was used for the request and a path
label to specify what xAPI URL path was hit.
xapi__v0_9_requests
xapi__v0_95_requests
xapi__v1_0_x_requests
Here's what the metric for accessing GET activities/state with xAPI 1.0 or above might look like:
xapi__v1_0_x_requests{path="activities/state", method="get"} 1
Statement Pipe Metrics
We monitor each xAPI statement pipe individually. You can see the IDs of your statements pipes using the /xapi/pipes resource. You can then monitor a specific pipe by matching its ID to the value of the pipe_id
label:
xapi_statement_pipe_processing_duration_seconds_bucket{le="0.01", pipe_id="0374c238-e5d8-4361-97e7-e5612e1acf18"} 0
xAPI Metadata Fetching
Every time Engine fetches metadata for an activity, it's logged under the xapi_metadata_fetches_total
meter.
This meter includes a label titled fetch_type
with possible values: success_registry
, success_other
, tried_registry
, tried_other
, objStoreCached_registry
, objStoreCached_other
.
Example:
xapi_metadata_fetches_total{fetch_type="success_registry"} 9
Background Processors
Engine can have several processes running in the background performing different tasks, like processing event messages or forwarding xAPI statements. The list below contains a complete record of the metric names associated with our various processors.
registration_postback_queue_processor
import_postback_queue_processor
x_api_statement_pipe_processor
scorm_to_tin_can_processor
x_api_token_deletion_processor
pens_async_processor
event_message_processor
notification_sequencing_deletion_processor
launch_history_expiration_processor
Example:
registration_postback_queue_processor_duration_seconds_bucket{le="0.05", status="running"} 66
Note: Alongside the typical _duration_seconds_bucket
, _duration_seconds_count
, and _duration_seconds_sum
suffixes for metrics, Background Processor metrics also track the completion and failure of tasks with the _successful
and _failure
suffixes.
Example:
registration_postback_queue_processor_success{status="completed"} 1
The Database
Engine keeps timers describing how long it takes to execute its various database operations. These timers follow our usual histogram timer reporting patterns from above, and they utilize a method
label. Internally, Engine separates its operations into queries (pulling data out of the database) and non-queries (inserting new or manipulating existing data). These are the two possible values for the method
label on our database metrics, and they can be used to help diagnose the source of any issues with database speed that you might be having.
database_data_helper_operations_duration_seconds_bucket{method="query", le="0.01"} 221
In addition to monitoring its own database operations, Engine can also keep track of certain JDBC operations. Right now, we only record the time it takes to acquire a connection, but more JDBC operations may be added in the future.
database_jdbc_operations_duration_seconds_bucket{le="0.01", operation="acquire_connection"} 5042
JVM Metrics
The Java version of Engine can also report on various JVM statistics. There are various metrics recorded within a few category that each have their own predictable prefix:
- Metrics describing garbage collection begin with
jvm_gc_
- Memory usage statistics are denoted by
jvm_memory_
jvm_threads_
denotes information about the JVM's thread usage