Performance Metrics
Overview
Engine can be configured to record and report on a number of metrics that cover various aspects of its performance. By default, this feature is disabled, but you can enable it by adding a setting called EnableRecordingMetrics to your Engine config file with a value of true. Once this setting is enabled, you can analyze those metrics by querying /RusticiEngine/metrics. The information is reported using a Prometheus exporter, so we recommend that you analyze this data using a Prometheus server or a similar tool. For more information, you can refer to the Prometheus website.
Preface: Histogram Timers
Most of our metrics are recorded using histogram-backed timers. This means that each timer has multiple different labels, and each label identifies the bucket of the histogram being described. A single timer will show up on multiple lines of the metric report. Let's use the timer that monitors the response time of every single HTTP request that Engine receives as our example:
http_all_requests_duration_seconds_bucket{le="0.01"} 20
http_all_requests_duration_seconds_bucket{le="0.05"} 42
http_all_requests_duration_seconds_bucket{le="0.1"} 50
...etc...
http_all_requests_duration_seconds_bucket{le="60"} 56
http_all_requests_duration_seconds_bucket{le="+inf"} 56
http_all_requests_duration_seconds_sum 10.37343612
http_all_requests_duration_seconds_count 219
Metrics ending in _bucket are used to categorize the various response times of the API. The le label is short for "less than or equal to", so a _bucket metric with a label of le="0.25" specifies the number of requests that Engine responded to in less than 0.25 seconds. These buckets are cumulative, so if a request lands in the le="0.25" bucket, then it will also contribute to the le="0.5" and le="1" buckets, all the way up to le="+inf". The _sum and _count metrics measure the total amount of time spent responding to given endpoint and the number of requests Engine has received to that endpoint.
Course Runtime
To see the response times of the endpoints used by the player, you can review course_runtime_operations_duration_seconds. Along with the usual le label, this metric also uses a label called operation, which has the following possible values:
launch_page- Used when launching a registration. All learning standards will use this page.player_configuration- This is the endpoint called by the player after being launched. This is used by all learning standards, except for Tin Can courses when the settingUsePlayerForTinCanLaunchesis set to false (the default value in Engine 2018+ is true).record_results- Called to record a registration's state during the course of a SCORM course.record_results_on_player_exit- Called when a learner exits a SCORM course.process_aicc_request- Used to perform various operations during AICC coursescmi5_au_launch- Only used when launching cmi5 courseslti_launch- Only used when launching LTI linkspersist_runtime_data- The saving of course runtime data to the database
Here is an example:
course_runtime_operations_duration_seconds_bucket{le="0.01", operation="player_configuration"} 128
The API
We keep recording metrics covering the requests to both versions of the API. The method label describes the HTTP method used to request the URL, and the url label contains the URL that was requested with any identifiers replaced with [id]. Aside from the name used for the metric and the possible values of the url label, versions 1 and 2 of the API are presented identically. They are formatted like this:
api_v2_requests_duration_seconds_bucket{method="get", le="0.01", url="/registrations/[id]"} 28
API Counters
In addition to the above histograms, Engine also records counters to keep track of the number of requests made to/by each tenant and the status codes used in its responses to API requests:
api_requests_total{tenant="default"} 3
api_responses_total{status_code="200"} 328
API Import Course Jobs
Every time a course is imported, we track how long it takes to do so under the api_import_course_jobs name. This metric has a label titled 'type', which designates the type of import as uploaded_zip, ad_hoc_import, import_by_manifest, file_path, or zip_url. An example metric for the 10-second bucket for ad-hoc imports follows:
api_import_course_jobs_duration_seconds_bucket{le="10", type="ad_hoc_import"} 1
Webhooks
In the past, Engine only officially supported postback updates for two things: registration rollups and import job status. In Engine 22, we introduced our new subscription system, allowing you to subscribe to Engine event notifications and receive postbacks to a URL of your choosing.
Although we strongly advise you use the new subscription system, Engine still is capable of using the legacy system. As such, the following metrics are used for both the new subscription system and the legacy postback system:
api_postback_successful_responses:postbackTypelabel with Queue ID as stringapi_postback_failed_responses:postbackTypelabel with Queue ID as stringapi_postback_queue_size:typelabel with Queue ID as stringapi_postback_queue_items:typelabel with Queue ID as string
Example:
api_postback_queue_size{type="rollup_reg"} 0
api_postback_queue_items{type="rollup_reg"} 0
Modern Webhooks Metrics
sub_postback:typelabel with potential valuesend_message
Example:
sub_postback_duration_seconds_bucket{le="30", type="send_message"} 1
Legacy Postback Metrics
api_postback:operationlabel with potential valuesimport_resultsandreg_rollup
Example:
api_postback{le="30", operation="import_results"} 1
The LRS
The performance of Engine's LRS can also be monitored. The metrics reporter uses the method label to differentiate the different xAPI resources being requested. For a detailed description of each of these endpoints, please refer to the xAPI spec. The values currently used for the method label are:
activitiesactivities_profileactivities_stateactorsactor_profileextended_actionsstatements
Example:
xapi_requests_duration_seconds_bucket{method="activities", le="0.01"} 17
xAPI Requests by Version
xAPI requests are tracked by version and sorted into three seperate buckets (listed below). These metrics include a method label for what HTTP method was used for the request and a path label to specify what xAPI URL path was hit.
xapi__v0_9_requestsxapi__v0_95_requestsxapi__v1_0_x_requests
Here's what the metric for accessing GET activities/state with xAPI 1.0 or above might look like:
xapi__v1_0_x_requests{path="activities/state", method="get"} 1
Statement Pipe Metrics
We monitor each xAPI statement pipe individually. You can see the IDs of your statements pipes using the /xapi/pipes resource. You can then monitor a specific pipe by matching its ID to the value of the pipe_id label:
xapi_statement_pipe_processing_duration_seconds_bucket{le="0.01", pipe_id="0374c238-e5d8-4361-97e7-e5612e1acf18"} 0
xAPI Metadata Fetching
Every time Engine fetches metadata for an activity, it's logged under the xapi_metadata_fetches_total meter.
This meter includes a label titled fetch_type with possible values: success_registry, success_other, tried_registry, tried_other, objStoreCached_registry, objStoreCached_other.
Example:
xapi_metadata_fetches_total{fetch_type="success_registry"} 9
Background Processors
Engine can have several processes running in the background performing different tasks, like processing event messages or forwarding xAPI statements. The list below contains a complete record of the metric names associated with our various processors.
registration_postback_queue_processorimport_postback_queue_processorx_api_statement_pipe_processorscorm_to_tin_can_processorx_api_token_deletion_processorpens_async_processorevent_message_processornotification_sequencing_deletion_processorlaunch_history_expiration_processor
Example:
registration_postback_queue_processor_duration_seconds_bucket{le="0.05", status="running"} 66
Note: Alongside the typical _duration_seconds_bucket, _duration_seconds_count, and _duration_seconds_sum suffixes for metrics, Background Processor metrics also track the completion and failure of tasks with the _successful and _failure suffixes.
Example:
registration_postback_queue_processor_success{status="completed"} 1
The Database
Engine keeps timers describing how long it takes to execute its various database operations. These timers follow our usual histogram timer reporting patterns from above, and they utilize a method label. Internally, Engine separates its operations into queries (pulling data out of the database) and non-queries (inserting new or manipulating existing data). These are the two possible values for the method label on our database metrics, and they can be used to help diagnose the source of any issues with database speed that you might be having.
database_data_helper_operations_duration_seconds_bucket{method="query", le="0.01"} 221
In addition to monitoring its own database operations, Engine can also keep track of certain JDBC operations. Right now, we only record the time it takes to acquire a connection, but more JDBC operations may be added in the future.
database_jdbc_operations_duration_seconds_bucket{le="0.01", operation="acquire_connection"} 5042
JVM Metrics
The Java version of Engine can also report on various JVM statistics. There are various metrics recorded within a few category that each have their own predictable prefix:
- Metrics describing garbage collection begin with
jvm_gc_ - Memory usage statistics are denoted by
jvm_memory_ jvm_threads_denotes information about the JVM's thread usage