Troubleshooting and diagnostics

Info: If you have Apama Smart Rules-only (also called Apama-ctrl-smartrules), the functionality described in this topic does not apply.

Downloading diagnostics and logs

If a user has READ permission for “CEP management”, then two links for downloading diagnostics information are available from the bottom of the Streaming Analytics application: one for downloading basic diagnostics information (the Diagnostics link) and another one for downloading enhanced (more resource-intensive) diagnostics information (the Enhanced link). These links are shown at the bottom of the home screen and also on the pages that appear when you go to the Analytics Builder and EPL Apps pages (that is, in the EPL app manager and in the model manager).

It may be useful to capture this diagnostics information when experiencing problems, or for debugging EPL apps. It is also useful to provide to product support if you are filing a support ticket. You can see a version number next to the links.

Basic diagnostics information is provided in a ZIP file named diagnostic-overview<timestamp>.zip and includes the following information (this should be typically a few Megabytes, and be generated in about 5 seconds):

Enhanced diagnostics information is provided in a ZIP file named diagnostic-enhanced<timestamp>.zip and includes the following information:

What a user can see or do depends on the permissions:

Log files of the Apama-ctrl microservice

There are two ways to get the logs of the Apama-ctrl microservice:

Contact product support if needed.

Diagnostics REST endpoints

The following diagnostics endpoints are available for REST requests. These require authentication as a user with READ permission for “CEP management”:

Alarms generated by the Apama-ctrl microservice

Alarms are created by user applications in the Cumulocity IoT tenant (for example, by an analytic model, an activated EPL file, or a smart rule). To learn about alarms in general, refer to Device Management > Working with alarms in the User guide. The Apama-ctrl microservice also generates alarms because it has encountered some problem, so that the user is notified about the situation. The information below is about alarms that are generated by the Apama-ctrl microservice, their causes, consequences and possible ways to resolve them.

Info: Alarms generated by Apama-ctrl about its own state are available as of Analytics Builder 10.5.0 and EPL Apps 10.5.0.

You can view alarms in the following ways:

  1. In the Cockpit application. See Cockpit in the User guide for detailed information.
  2. In the Administration application, under Ecosystem > Microservices. Click the Apama-ctrl microservice and then click Status. See Administration > Managing and monitoring microservices in the User guide for detailed information.
  3. From the Streaming Analytics application. Click the Diagnostics (or Enhanced) link which is provided at the bottom of the home screen. A ZIP file is then downloaded that contains alarms information under /alarm/alarms_apama-ctrl-object.json. See Downloading diagnostics and logs for detailed information.

Alarm severities

Severity Description
CRITICAL Apama-ctrl was unable to continue running the user’s applications and will require corrective action.
MAJOR Apama-ctrl has encountered a situation that will result in some loss of service (for example, due to a restart).
MINOR Apama-ctrl has a problem that you might want to fix.
WARNING There is a warning.

Alarms created by the Apama-ctrl microservice

Apama-ctrl can create alarms to notify users in scenarios such as the correlator running out of memory, uncaught exceptions in activated EPL files, and so on. Once you see an alarm in the Cumulocity IoT tenant, you should diagnose it and resolve it depending on the severity level of the raised alarm. Each alarm has details such as title, text, type, date, and count (represents the number of times the alarm has been raised).

The following is a list of the alarms. The information further down below explains when these alarms will occur, their consequences, and how to resolve them.

Once the cause of an alarm is resolved, you must acknowledge and clear the alarm in the Cumulocity IoT tenant. Otherwise, you will continue to see the alarm until a further restart of the Apama-ctrl microservice.

Info: The alarm texts for the alarms below may undergo minor changes in the future.

Change in tenant options and restart of Apama-ctrl

This alarm is raised when a tenant option changes in the analytics.builder or streaminganalytics category. For details on the tenant options, refer to the Tenant API in the Cumulocity IoT OpenAPI Specification for more details.

Analytics Builder allows you to configure its settings by changing the tenant options, using key names such as numWorkerThreads or status_device_name. For example, if you want to process things in parallel, you can set numWorkerThreads to 3 by sending a REST request to Cumulocity IoT, which will update the tenant option. Such a change automatically restarts the Apama-ctrl microservice. To notify the users about the restart, Apama-ctrl raises an alarm, saying that Apama has detected a change in a tenant option and will restart in order to use it.

Once you see this alarm, you can be sure that your change is effective.

Safe mode on startup

This alarm is raised whenever the Apama-ctrl microservice switches to safe mode.

Apama detects if it has been repeatedly restarting. If it looks like this has been caused by any kind of user asset (EPL, analytic models, extensions), Apama disables them all as a precaution. Potential causes are, for example, an EPL app that consumes more memory than is available or an extension containing bugs.

You can check the mode of the microservice (either normal or safe mode) by making a REST request to service/cep/diagnostics/apamaCtrlStatus (available as of EPL Apps 10.5.7 and Analytics Builder 10.5.7), which contains a safe_mode flag in its response.

To diagnose the cause of an unexpected restart, you can try the following:

In safe mode, all previously active analytic models and EPL apps are deactivated and must be manually re-activated.

Deactivating models in Apama Starter

This alarm is raised when Apama-ctrl switches from the fully capable microservice to Apama Starter with more than 3 active models.

In Apama Starter, a user can have a maximum of 3 active models. For example, a user is working with the fully capable Apama-ctrl microservice and has 5 active models, and then switches to Apama Starter. Since Apama Starter does not allow more than 3 active models, it deactivates all the active models (5) and raises an alarm to notify the user.

High memory usage

This alarm is raised whenever the correlator consumes 90% of the maximum memory permitted for the microservice container. During this time, the Apama-ctrl microservice automatically generates the diagnostics overview ZIP file which contains diagnostics information used for identifying the most likely cause for memory consumption.

There are 3 variants of this alarm, depending on the time and count restrictions of the generated diagnostics overview ZIP file.

First variant:

Second variant:

Third variant:

Running EPL apps (and to a lesser extent, smart rules and analytic models) consumes memory, the amount will depend a lot on the nature of the app running. The memory usage should be approximately constant for a given set of apps, but it is possible to create a “memory leak”, particularly in an EPL file or a custom block. The Apama-ctrl microservice monitors memory and raises an alarm with WARNING severity if the 90% memory limit is reached along with the diagnostics overview ZIP file and saves it to the files repository (as mentioned in the alarm text).

Apama-ctrl generates the diagnostics overview ZIP files with the following conditions:

To diagnose high-memory-consuming models and EPL apps, you can try the following (it could be listener leaks, excessive state being stored or spawned monitors leaking, and so on):

If the memory continues to grow, then when it reaches the limit, the correlator will run out of memory and Apama-ctrl will shut down. To prevent the microservice from going down, you must fix this as a priority.

See also Diagnostic tools for Apama in Cumulocity IoT in Software AG’s Tech Community.

Warning or higher level logging from an EPL file

This alarm is raised whenever messages are logged by Apama EPL files with specific log levels (including CRITICAL, FATAL, ERROR and WARNING).

The Streaming Analytics application allows you to deploy EPL files to the correlator. The Apama-ctrl microservice analyzes logged content in the EPL files and raises an alarm for specific log levels with details such as monitor name, log text and alarm type (either of WARNING or MAJOR), based on the log level.

For example, the following is a simple monitor which prints a sequence and logs some texts at different EPL log levels.

monitor Sample{
   action onload() {
      log "Info"; // default log level is now INFO
      log "Fatal Error" at FATAL; // log level is FATAL
      log "Critical Error" at CRIT; // log level is CRITICAL
      log "Warning" at WARN; // log level is WARNING
   }
}

Apama-ctrl analyzes all the log messages, filters out only certain log messages, and raises an alarm for the identified ones. Thus, Apama-ctrl generates the following three alarms for the above example:

First alarm:

Second alarm:

Third alarm:

An EPL file throws an uncaught exception

You have seen that the Apama-ctrl microservice raises alarms for logged messages. In addition, there can also be uncaught exceptions (during runtime). Apama-ctrl identifies such exceptions and raises alarms so that you can identify and fix the problem.

For example, the following monitor throws IndexOutOfBoundsException during runtime:

monitor Sample{
   sequence<string> values := ["10", "20", "30"];
   action onload() {
      // IndexOutOfBoundsException (runtime error)
      log "Value = " + values[10] at ERROR;
   }
}

Apama-ctrl generates the following alarm for the above example:

You can diagnose the issue by the monitor name and line number given in the alarm.

For more details, you can also check the Apama logs if the tenant has the “microservice hosting” feature enabled. Alarms of this type should be fixed as a priority as these uncaught exceptions will terminate the execution of that monitor instance, which will typically mean that your app is not going to function correctly. This might even lead to a correlator crash if not handled properly.

An EPL file blocks the correlator context for too long

If an EPL app has an infinite loop, it may block the correlator context for too long, not letting any other apps run in the same context or, even worse, causes excessive memory usage (as the correlator is unable to perform any garbage collection cycles) leading to the app running out of memory. The Apama-ctrl microservice identifies such scenarios (the correlator logs warning messages if an app is blocking a context for too long) and raises alarms, so that the user can identify and fix the problem.

For example, the following monitor blocks the correlator main context:

event MyEvent {
}

monitor Sample{
    action onload() {
        while true {
            // do something
            send MyEvent() to "foo";
        }
    }
}

Apama-ctrl generates the following alarm for the above example:

You can diagnose the issue by the monitor name and context name given in the alarm.

For more details, you can also check the Apama logs if the tenant has the “microservice hosting” feature enabled. Alarms of this type should be fixed as a priority as these scenarios may lead to the microservice and correlator running out of memory.

EPL app restore timeout on restart of Apama-ctrl

If restoring an EPL app on a restart of the Apama-ctrl microservice takes a long time and exceeds the time limit specified by the recovery.timeoutSecs tenant option (in the streaminganalytics category) or a default of 60 seconds, the Apama-ctrl microservice times out and raises an alarm, indicating that it will restart and reattempt to restore the EPL app. The alarm text includes the names of any EPL apps that are considered to be the reason for the timeout.

The following information is only included in the alarm text if the Apama-ctrl microservice detects that the timeout is due to some EPL apps: “The following EPL apps may be the cause of this: <comma-separated list of app names>.”. If no such apps are detected, this information is omitted from the alarm text.

Multiple extensions with the same name

This alarm is raised when the Apama-ctrl microservice tries to activate the deployed extensions during its startup process and there are multiple extensions with the same name.

This disables all extensions that were deployed to Apama-ctrl. In order to use the deployed extensions, the user has to decide which extensions to keep and then delete the duplicate ones.

Info: In case of multiple duplicates, this alarm is only listed once.

Smart rule configuration failed

This alarm is raised if a smart rule contains an invalid configuration.

To diagnose the cause, download the diagnostics overview ZIP file as described in Downloading diagnostics and logs. Or, if that fails, log on as an administrator and look at the result of a GET request to /service/smartrule/smartrules?withPrivateRules=true. Review the smart rules JSON and look for invalid smart rule configurations. Such smart rules need to be corrected.

The Apama microservice log contains more details on the reason for the smart rule configuration failure. For example, it is invalid to configure an “On measurement threshold create alarm” smart rule with a data point that does not exist.

Smart rule restore failed

This alarm is raised if a corrupt smart rule is present in the inventory and the correlator therefore fails to recover it correctly during startup.

To diagnose the cause, download the diagnostics overview ZIP file as described in Downloading diagnostics and logs. Or, if that fails, log on as an administrator and look at the result of a GET request to /service/smartrule/smartrules?withPrivateRules=true. Review the smart rules JSON and look for invalid smart rule configurations. Such smart rules may need to be deleted or corrected.

Connection to correlator lost

This alarm is raised in certain cases when the connection between the Apama-ctrl microservice and the correlator is lost. This should not happen, but can be triggered by high load situations.

Apama-ctrl will automatically restart. Report this to product support if this is happening frequently.

The correlator queue is full

This alarm is raised whenever the correlator queue is full, including both input and output queues.

The correlator’s input and output queues are periodically monitored to check for building up of events. If the pending queue size grows above the normal threshold (20,000 for the input queue and 10,000 for the output queue), an alarm is raised. The alarm text contains a snapshot of the correlator status at the time of raising the alarm. A correlator with a full input or output queue can cause a serious performance degradation.

The correlator queue size is based on the number of events, not raw bytes.

Check the alarm text to get an indication of which queue is blocking. This also contains information about the slowest receiver and the most backed-up context. To diagnose the cause, see the information given in The CEP queue is full. A problem is likely to trigger the “correlator queue is full” alarm followed by the “CEP queue is full” alarm.

The CEP queue is full

This alarm is raised whenever the CEP queue for the respective tenant is full.

Karaf nodes that send events to the CEP engine maintain per-tenant queues for the incoming events. This data gets processed by the CEP engine for the hosted CEP rules. For various reasons, these queues can become full and cannot accommodate newly arriving data. In such cases, an alarm is sent to the platform so that the end users are notified about the situation.

If the CEP queue is full, older events are removed to handle new incoming events. To avoid this, you must diagnose the cause of the queue being full and resolve it as soon as possible.

The CEP queue size is based on the number of CEP events, not raw bytes.

To diagnose the cause, you can try the following. It may be that the Apama-ctrl microservice is running slow because of time-consuming rules in the script, or the microservice is deprived of resources, or code is not optimized, and so on. Check the input and output queues from the “correlator queue is full” alarm (or from the microservice logs or from the diagnostics overview ZIP file under /correlator/status.json).