Operating Cumulocity DataHub
This section describes how you can access system information, usage statistics, and audit logs.
This section describes how you can access system information, usage statistics, and audit logs.
In the navigator, select Administration and then System status to get information about the system configuration and its status.
Under Microservice you will find the status of the microservice, which is either marked as green or red. This status reflects whether the microservice can be accessed from the web application. If the microservice is accessible, its current version is shown. If not, check the status of the microservice and its logs as described in Managing applications.
Under Web application you will find the version of the web application.
Under Dremio you will find the status of Dremio, which is either marked as green or red. This status reflects whether Dremio can be accessed from the microservice. If Dremio is accessible, its current version is shown. If not, check the status of the microservice and its logs as described in Managing applications.
Under Management you will find the setup of the system. If you expand that box by clicking on the arrow to the right, all relevant system properties and their values are listed. Note that these values cannot be modified for a running microservice. The tenant administrator must redeploy the microservice with corresponding new values.
If enabled, Cumulocity DataHub tracks usage statistics on the amount of data being processed. These statistics are collected for offloading queries and track the amount of data these queries read from the Operational Store of Cumulocity. The statistics are also collected for ad-hoc queries and track the amount of data these queries read from the data lake. The usage statistics can be utilized for a volume-based charging. They can also be utilized to pinpoint resource-intensive queries in terms of network load.
In the navigator, select Administration and then Usage statistics to view the usage statistics.
In the action bar, a date control allows you to select the month for which you want to see the usage statistics.
The three top panels show overall summary statistics as well as statistics separated for offloading and ad-hoc queries. If data from the month before the selected month is available, a tendency arrow illustrates whether the data volume of the selected month has decreased, increased, or stayed flat. The panels with the offloading and the ad-hoc query statistics additionally list the days with minimum/maximum volume as well as the daily average volume.
The table below the summary statistics shows the details on a per-day basis for the selected month. For each day, the volume offloaded and the volume queried are shown as well as their sum, which constitutes the daily volume. In addition the percentage of the monthly volume is shown, that is, how much did the daily volume contribute to the overall monthly volume. The date of each entry links to the Query log, which lists all queries for the respective day.
Auditing shows in the query log the queries being executed and in the system log the operations that users have carried out.
In the navigator, select Auditing and then Query log to view the query log.
At the top of the page you can select either offload or ad-hoc queries, define a text filter on the offloading task/ad-hoc query string, and select a time period. Use the pagination buttons at the bottom of the page to navigate through the result list.
For each offloading query, the following information is provided:
Column name | Description |
---|---|
Offloading task | The task name of the offloading pipeline, complemented by a status icon showing success or failure of the pipeline execution |
Runtime | The runtime of the execution |
Data scanned (MB) | The amount of data the offloading query has read from the Operational Store of Cumulocity |
Data billed (MB) | The amount of data being billed (depending also on your contract); amounts of data less than 10 MB in an offloading query will be billed as if they were 10 MB |
Details | The internal task UUID in an expandable box |
For each ad-hoc query, the following information is provided:
Column name | Description |
---|---|
User | The username of the Dremio user, which has been used to execute the query |
Query | The SQL query, complemented by a status icon showing success or failure of the query execution |
Runtime | The runtime of the execution |
Data scanned (MB) | The amount of data the ad-hoc query has read from the data lake |
Data billed (MB) | The amount of data being billed (depending also on your contract); amounts of data less than 10 MB in an ad-hoc query will be billed as if they were 10 MB |
Details | The query string as well as a link to the associated Dremio job in an expandable box |
In the navigator, select Auditing and then System log to view the system log.
At the top of the page you can select log entries having status all/successful/errorneous/running, define a text filter on the log entries, and select a time period. Use the pagination buttons at the bottom of the page to navigate through the result list.
For each log entry, the following information is provided:
Column name | Description |
---|---|
User | The user that has carried out the operation |
Event | The type of operation |
Details | The details of the operation and, if available, further information in an expandable box |
The Cumulocity DataHub microservice exposes an endpoint to automatically monitor the health of active offloading configurations. The ETL pipeline health can be monitored with the endpoint GET /service/datahub/scheduler/health:
The endpoint examines the latest job executions of all jobs and classifies them:
If all jobs are classified as STEADY, the endpoint returns the HTTP status code 200 with the following message:
“HTTP 200 CDHCBEI0029 - Scheduler healthcheck succeeded.”
Otherwise, the endpoint returns the HTTP status code 500 with the following message:
“HTTP 500 CDHCBEE0031 - Scheduler healthcheck failed: There were failed or suspended jobExecutions.”
The response body indicates the jobs to be checked by an administrator:
{
"error" : "There were failed or suspended jobExecutions: \n\nCRITICAL: Job failed: uuid=0d2eb545-cae5-4718-b6c1-50c4169bac69, jobType=CTAS, jobRunId=NON_CLUSTERED1580741460697\n\n"
}
The endpoint can be accessed by any logged in Cumulocity user who is authorized to access the Cumulocity DataHub microservice.
Cumulocity DataHub uses a data lake to store data being offloaded from the Cumulocity operational database. The data is organized in hierarchical folders, following a temporal hierarchy. Within the folders the offloaded data is organized in Parquet files. During the offloading process Cumulocity DataHub creates temporary Parquet files, holding intermediate data, which are deleted afterwards. In order to prevent data being spread over multiple small files, a compaction process is executed regularly, producing fewer, larger files.
The contents and hierarchy of the data lake must not be modified. There is a high risk that data gets lost and subsequent querying of the data lake produces incomplete results.
The data within the data lake is organized hierarchically. Each offloading pipeline is associated with one target table. Each target table corresponds to a folder in the data lake with the same name. Such a folder consists of three different types of subfolders:
Cumulocity DataHub may produce empty Parquet files in certain constellations, like an execution node crashing during a write process. If such empty files exist in the data lake, the initial configuration as well as offloading runs will fail. This requires interaction with the data lake. Cumulocity DataHub does not delete those empty files automatically. You must delete them manually using the tooling of your data lake provider, like AWS S3 Console or Azure Storage Explorer.
If the initial configuration has failed due to an empty Parquet file, the error message shown during the failed configuration attempt provides the details on the file. This includes the folder containing the empty Parquet file, like c8y_cdh_temp/connectionTest. You must delete the folder with all its sub-folders, including potential other non-empty Parquet files, to avoid inconsistencies caused by incomplete, partially written data.
If an offloading has failed, the associated error is shown in the job history, providing details on the empty Parquet file causing the error. You have to browse to the associated collection folder in the data lake, like events or alarms. Within that folder a couple of sub-folders can exist, starting with incremental_, daily_, monthly_, or chunk_. The error message gives you the corresponding folder in which that empty Parquet file is located, like events/incremental_1694787385. You must delete the folder with all its sub-folders. With the next offloading run, the corresponding time frame of data within that folder will be offloaded again, so that no data is lost. However, data might be lost if the data already moved out of the retention window in the operational database before a corresponding offloading was successfully executed.