Operating Cumulocity DataHub

This section describes how you can access system information, usage statistics, and audit logs.

Checking system information

Requirements
You need administration permissions to access system information. See Defining Cumulocity DataHub permissions and roles for details.

In the navigator, select Administration and then System status to get information about the system configuration and its status.

Under Microservice you will find the status of the microservice, which is either marked as green or red. This status reflects whether the microservice can be accessed from the web application. If the microservice is accessible, its current version is shown. If not, check the status of the microservice and its logs as described in Managing applications.

Under Web application you will find the version of the web application.

Under Dremio you will find the status of Dremio, which is either marked as green or red. This status reflects whether Dremio can be accessed from the microservice. If Dremio is accessible, its current version is shown. If not, check the status of the microservice and its logs as described in Managing applications.

Under Management you will find the setup of the system. If you expand that box by clicking on the arrow to the right, all relevant system properties and their values are listed. Note that these values cannot be modified for a running microservice. The tenant administrator must redeploy the microservice with corresponding new values.

Tracking usage statistics

If enabled, Cumulocity DataHub tracks usage statistics on the amount of data being processed. These statistics are collected for offloading queries and track the amount of data these queries read from the Operational Store of Cumulocity. The statistics are also collected for ad-hoc queries and track the amount of data these queries read from the data lake. The usage statistics can be utilized for a volume-based charging. They can also be utilized to pinpoint resource-intensive queries in terms of network load.

Info
The tracking of usage statistics is supported for the Cumulocity DataHub Cloud edition. It is not supported for the Cumulocity DataHub Edge edition.

In the navigator, select Administration and then Usage statistics to view the usage statistics.

In the action bar, a date control allows you to select the month for which you want to see the usage statistics.

The three top panels show overall summary statistics as well as statistics separated for offloading and ad-hoc queries. If data from the month before the selected month is available, a tendency arrow illustrates whether the data volume of the selected month has decreased, increased, or stayed flat. The panels with the offloading and the ad-hoc query statistics additionally list the days with minimum/maximum volume as well as the daily average volume.

The table below the summary statistics shows the details on a per-day basis for the selected month. For each day, the volume offloaded and the volume queried are shown as well as their sum, which constitutes the daily volume. In addition the percentage of the monthly volume is shown, that is, how much did the daily volume contribute to the overall monthly volume. The date of each entry links to the Query log, which lists all queries for the respective day.

Info
The statistics are refreshed once per hour. Therefore, the statistics for the current month may not include the latest data. The statistics are deleted after a retention period, so for older months statistics may no longer be available.

Viewing audit logs

Auditing shows in the query log the queries being executed and in the system log the operations that users have carried out.

Query log

In the navigator, select Auditing and then Query log to view the query log.

Requirements
The Cumulocity DataHub feature for storing query profiles must be enabled. The profiles are deleted after a retention period, so for older months profiles may no longer be available.

At the top of the page you can select either offload or ad-hoc queries, define a text filter on the offloading task/ad-hoc query string, and select a time period. Use the pagination buttons at the bottom of the page to navigate through the result list.

For each offloading query, the following information is provided:

Column name Description
Offloading task The task name of the offloading pipeline, complemented by a status icon showing success or failure of the pipeline execution
Runtime The execution runtime of the Dremio queries related to the offloading run
Data scanned (MB) The amount of data the offloading query has read from the Operational Store of Cumulocity
Data billed (MB) The amount of data being billed (depending also on your contract); amounts of data less than 10 MB in an offloading query will be billed as if they were 10 MB
Details The internal task UUID in an expandable box

For each ad-hoc query, the following information is provided:

Column name Description
User The username of the Dremio user, which has been used to execute the query
Query The SQL query, complemented by a status icon showing success or failure of the query execution
Runtime The execution runtime of the query
Data scanned (MB) The amount of data the ad-hoc query has read from the data lake
Data billed (MB) The amount of data being billed (depending also on your contract); amounts of data less than 10 MB in an ad-hoc query will be billed as if they were 10 MB
Details The query string as well as a link to the associated Dremio job in an expandable box

System log

In the navigator, select Auditing and then System log to view the system log.

At the top of the page you can select log entries having status all/successful/errorneous/running, define a text filter on the log entries, and select a time period. Use the pagination buttons at the bottom of the page to navigate through the result list.

For each log entry, the following information is provided:

Column name Description
User The user that has carried out the operation
Event The type of operation
Details The details of the operation and, if available, further information in an expandable box

Endpoints for monitoring

ETL pipeline health

The Cumulocity DataHub microservice exposes an endpoint to automatically monitor the health of active offloading jobs as well as compaction and data collection jobs. The health status can be monitored with the endpoint GET /service/datahub/scheduler/health. The endpoint accepts two optional parameters, format and check.

The parameter format determines the format of the response body. It supports the following values:

Value Definition
text Send the response body as plain text.
json Send the response body as JSON.

If format is not set, the text option is used by default.

The parameter check defines which jobs are reported. The parameter supports the following values:

Value Definition
ALL All jobs are reported.
OFFLOADING Only offloading jobs are reported. In corresponding messages such a job is also denoted as CTAS job.
COMPACTION Only compaction jobs are reported.
DremioJobDetailPersistence_OFFLOADING Only the job for collecting and persisting offloading usage data is reported.
DremioJobDetailPersistence_QUERY Only the job for collecting and persisting usage data for ad-hoc queries is reported.
C8Y_BILLING_METRICS Only the job for submitting usage data is reported.

If check is not set, all jobs except C8Y_BILLING_METRICS are reported.

The endpoint examines the latest job executions of qualified jobs and classifies them:

  • If the job has failed, it is reported as CRITICAL.
  • If the job is still running, it is categorized as follows:
    • If it is running for up to one hour, its health is classified as STEADY.
    • If it is running for up to six hours, its health is classified as WARNING.
    • If it is running for more than six hours, its health is classified as CRITICAL.
  • If the job has succeeded, it is checked whether it was the last job that should have been run for this configuration. If there should have been a new run of this job and the system is already 10 minutes behind the scheduled execution time, the job is classified as CRITICAL. Otherwise, the job is classified as STEADY.

If all jobs are classified as STEADY, the endpoint returns the HTTP status code 200 with the following message:

“HTTP 200 CDHCBEI0029 - Scheduler healthcheck succeeded.”

Otherwise, the endpoint returns the HTTP status code 500 with the following message:

“HTTP 500 CDHCBEE0031 - Scheduler healthcheck failed: There were failed or suspended jobExecutions.”

The response body indicates the jobs to be checked by an administrator:

“There were failed or suspended jobExecutions:
CRITICAL: Job should already have been executed at 14:08:03.705: uuid=34391b71-abaa-477e-b870-2c32aa6ea790, jobType=CTAS, jobRunId=CDHScheduler_9cd4309c-99d7-43ae-92f7-4f1d267faff71713875003234”

The endpoint can be accessed by any logged in Cumulocity user who is authorized to access the Cumulocity DataHub microservice.

Managing the data lake

Cumulocity DataHub uses a data lake to store data being offloaded from the Cumulocity operational database. The data is organized in hierarchical folders, following a temporal hierarchy. Within the folders the offloaded data is organized in Parquet files. During the offloading process Cumulocity DataHub creates temporary Parquet files, holding intermediate data, which are deleted afterwards. In order to prevent data being spread over multiple small files, a compaction process is executed regularly, producing fewer, larger files.

The contents and hierarchy of the data lake must not be modified. There is a high risk that data gets lost and subsequent querying of the data lake produces incomplete results.

Folder structure

The data within the data lake is organized hierarchically. Each offloading pipeline is associated with one target table. Each target table corresponds to a folder in the data lake with the same name. Such a folder consists of three different types of subfolders:

  • Monthly/daily folder: The folder name starts with monthly or daily followed by the timespan of data managed within that folder. For example, monthly_2024_01 contains all data from January 2024, while daily_2024_01_15 contains all data from the 15th of January 2024.
  • Initial offloading folder: When an offloading pipeline for the measurements collection offloads for the first time, all folders with data from this initial offloading is located in folders starting with chunk. Within a chunk folder data is also organized hierarchically with respect to years, months, and days, as encoded in the folder names. The chunk folders are optional.
  • Internal folders: Folders starting with incremental contain internal information and must not be deleted.

Empty Parquet files

Cumulocity DataHub may produce empty Parquet files in certain constellations, like an execution node crashing during a write process. If such empty files exist in the data lake, the initial configuration as well as offloading runs will fail. This requires interaction with the data lake. Cumulocity DataHub does not delete those empty files automatically. You must delete them manually using the tooling of your data lake provider, like AWS S3 Console or Azure Storage Explorer.

If the initial configuration has failed due to an empty Parquet file, the error message shown during the failed configuration attempt provides the details on the file. This includes the folder containing the empty Parquet file, like c8y_cdh_temp/connectionTest. You must delete the folder with all its sub-folders, including potential other non-empty Parquet files, to avoid inconsistencies caused by incomplete, partially written data.

If an offloading has failed, the associated error is shown in the job history, providing details on the empty Parquet file causing the error. You have to browse to the associated collection folder in the data lake, like events or alarms. Within that folder a couple of sub-folders can exist, starting with incremental_, daily_, monthly_, or chunk_. The error message gives you the corresponding folder in which that empty Parquet file is located, like events/incremental_1694787385. You must delete the folder with all its sub-folders. With the next offloading run, the corresponding time frame of data within that folder will be offloaded again, so that no data is lost. However, data might be lost if the data already moved out of the retention window in the operational database before a corresponding offloading was successfully executed.