This section describes how to run Cumulocity DataHub on the Cumulocity Edge, the local version of Cumulocity.
DataHub Edge is an optional component of Edge. DataHub Edge complements the ad-hoc querying of recent device data with analytical querying over long periods of time. For that purpose, data is moved from the Operational Store of Edge to a local data lake, with the data being stored in a concise and query-efficient format. DataHub Edge allows you to run SQL queries against the data lake contents so that you can gain more insights into your device data.
DataHub Edge is the counterpart of DataHub, the variant for cloud deployments, offering the same set of functionality. To learn more about DataHub in general, see DataHub overview.
Cumulocity DataHub Edge overview
Documentation overview
The following sections will walk you through all the functionalities of Cumulocity DataHub Edge in detail.
For your convenience, here is an overview of the contents:
Cumulocity Edge is an onsite, single-server, and single-tenant variant of the Cumulocity Core platform. It is delivered as a software appliance designed to run on industrial PCs or local servers. Cumulocity DataHub is available as an add-on to Cumulocity Edge.
Cumulocity DataHub Edge offers the same functionality as the cloud-variant of Cumulocity DataHub, but stores the data locally. You can define offloading pipelines, which regularly move data from the Operational Store of Cumulocity into a data lake. In the Edge setup, a NAS is used as data lake. Dremio, the internal engine of Cumulocity DataHub, can access the data lake and run analytical queries against its contents, using SQL as the query interface.
Cumulocity DataHub Edge consists of the following building blocks:
The Cumulocity DataHub UI is deployed as a web application in the Cumulocity core. It provides the frontend for defining and managing offloading pipelines.
The Cumulocity DataHub backend manages offloading pipelines and their scheduled execution. The backend and its associated database run within one Docker container in the Docker daemon. Its internal state, for example, the defined offloading configurations, is persisted on the central data disk.
The query processing is based on a Dremio master and a Dremio executor as well as on a ZooKeeper instance. Dremio master and ZooKeeper run in one Docker container and the Dremio executor runs in another one. Both containers are run by the Docker daemon. The internal state of the containers, for example, the query job history, is persisted on the central data disk. In the above figure just Dremio is shown for reasons of simplicity.
The data lake is located on the central data disk.
Cumulocity DataHub Edge versus Cumulocity DataHub cloud deployments
Cumulocity DataHub Edge uses the same software as Cumulocity DataHub, though in the following aspects these two variants differ:
Area
Cumulocity DataHub Edge
Cumulocity DataHub Cloud
High Availability
Depending on the underlying virtualization technology
Depending on the cloud deployment setup
Vertical scalability
Yes
Yes
Horizontal scalability
No
Yes
Upgrades with no downtime
No
No
Root access
No
Yes, if customer is hosting
Installation
Offline
Online
Dremio cluster setup
1 master, 1 executor
Minimum 1 master, 1 executor
Dremio container management
Docker daemon
Kubernetes
Cumulocity DataHub backend container management
Docker daemon
Microservice in Cumulocity Core
Data lakes
NAS
Azure Storage, S3, HDFS, (NAS)
Setting up Cumulocity DataHub Edge
Prerequisites
Before setting up Cumulocity DataHub Edge, you must check the following prerequisites:
Item
Details
Cumulocity Edge
The local version of Cumulocity is set up on a Virtual Machine (VM). See also Installing Edge.
Cumulocity DataHub Edge archive
You have downloaded the archive with all installation artifacts as described under Installation requirements.
Internet access
Internet access is not required.
Hardware requirements
The hardware requirements for running a bare Cumulocity Edge instance are described in Requirements. When Cumulocity DataHub Edge is additionally running, the hardware requirements of the virtual machine are as follows:
100 GB of free disk space plus sufficient free disk space for the data lake contents
Intel x86 CPU
Recommended: 12 GB RAM, minimum: 8 GB RAM
Recommended: 4 logical CPU cores, minimum: 2 logical CPU cores
One NIC
Hardware requirements for the host OS are excluded.
Setting up Cumulocity DataHub Edge
Copy the Cumulocity DataHub Edge archive to the Cumulocity Edge.
During script execution, you are prompted for the username and password of the administration user of the tenant edge. During installation, you are also prompted to set the new password of the Dremio admin account. It takes a few minutes to complete the installation. After completion you can delete the Cumulocity DataHub Edge archive.
The install script runs the following basic steps:
Deploy the Cumulocity DataHub Edge UI as a web application to Cumulocity Core
Start a Docker container with the Cumulocity DataHub Edge backend and the database system for managing the backend state
Start a Docker container with the Dremio master and a ZooKeeper instance
Start a Docker container with the Dremio executor
Configure corresponding roles and permissions in Cumulocity Core
The Docker containers will be restarted automatically if the container itself fails or the applications within are no longer reachable.
The containers are configured to store their application state on the data disk under /opt/mongodb:
/cdh-master/data: the state of the Dremio master
/cdh-executor/data: the state of the Dremio executor
/cdh-console/db: the state of the Cumulocity DataHub Edge backend
/cdh-master/datalake: the data lake folder
Caution
You must not modify the contents of these folders as this may corrupt your installation.
Upgrading Cumulocity DataHub Edge
An upgrade of Cumulocity DataHub Edge follows the same steps as the initial setup. First, you must copy the archive with the new version to Cumulocity Edge. Next, you must log in as admin. Then you must run the install script using the new version.
During script execution, the already installed version is detected and the script runs an upgrade using the new version. It takes a few minutes to complete the installation. After completion you can delete the Cumulocity DataHub Edge archive.
Adapting to network changes of Cumulocity Edge
There might be cases where you must change the network setup of your Edge installation, for example by setting the IP range used by Edge internally or changing the domain name. The network configuration of Cumulocity DataHub Edge must be adapted to such a change by running the script /opt/softwareag/cdh/bin/restart.sh once. The script restarts Cumulocity DataHub with parameters aligned with the new network configuration.
Accessing Cumulocity DataHub Edge
The different Cumulocity DataHub Edge interfaces can be accessed in the same way as in a cloud deployment of Cumulocity DataHub.
Interface
Description
Cumulocity DataHub Edge UI
The UI can be accessed in the application switcher after you have logged into the Cumulocity Edge UI. Alternatively you can access it directly under http://edge_domain_name/apps/datahub-ui or https://edge_domain_name/apps/datahub-ui, depending on whether TLS/SSL is used or not. A login is required as well, with "edge" being used as tenant name.
Dremio UI
On the Cumulocity DataHub Edge home page you will find a link to the Dremio UI. Alternatively you can access it directly under http://datahub.edge_domain_name or https://datahub.edge_domain_name, depending on whether TLS/SSL is used or not. You can log in as admin using the password defined in the installation procedure.
Cumulocity DataHub JDBC/ODBC
You find the connection settings and the required driver version for JDBC/ODBC in the Cumulocity DataHub Edge UI on the Home page.
Cumulocity DataHub REST API
The path of the microservice which hosts the API is https://edge_domain_name/service/datahub.
Dremio REST API
The Dremio URL to run REST API requests against is either http://datahub.edge_domain_name or https://datahub.edge_domain_name, depending on whether TLS/SSL is used or not.
Requirements
For JDBC/ODBC you must configure Cumulocity Edge so that port 32010 can be accessed from the host system. For instructions on port forwarding see Installing Edge.
The setup of the Dremio account and the data lake is done in the same way as in a cloud deployment. See Setting up Cumulocity DataHub for details.
Cumulocity DataHub Edge is configured to use a NAS as data lake. When configuring the NAS use as mount path /datalake. This path is mounted to /opt/mongodb/cdh-master/datalake.
Changing Dremio memory configuration on Cumulocity DataHub Edge
Depending on the use case, it might be necessary to increase the memory available to Dremio, the internal engine of Cumulocity DataHub. By default, Dremio is configured to consume a maximum of 4 GB of RAM (2 GB assigned to both master node and executor node).
Depending on the situation, one either must increase the memory of Dremio’s master or executor node. In many cases, the master node’s memory is the limiting factor, but not always. Inspecting the query profiles in Dremio helps to determine where the bottleneck occurs.
Master node memory configuration
Run the following steps:
Log into edge via SSH
As root, run vi /etc/cdh/cdh-master/dremio-env and change DREMIO_MAX_HEAP_MEMORY_SIZE_MB=1750 and DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=250 to your needs. For example, you can double both values.
Run service cdh-master restart.
Executor node memory configuration
Run the following steps:
Log into edge via SSH
As root, run vi /etc/cdh/cdh-executor/dremio-env and change DREMIO_MAX_HEAP_MEMORY_SIZE_MB=1024 and DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=1488 to your needs. For example, you can double both values.
Run service cdh-executor restart.
Working with Cumulocity DataHub Edge
Cumulocity DataHub Edge offers the same set of functionality as the cloud variant. See Working with Cumulocity DataHub for details on configuring and monitoring offloading jobs, querying offloaded Cumulocity data, and refining offloaded Cumulocity data.
Operating Cumulocity DataHub Edge
Similar to the cloud variant, Cumulocity DataHub Edge UI allows you to check system information and view audit logs. See Operating Cumulocity DataHub for details.
When managing Cumulocity DataHub Edge, the following standard tasks are additionally relevant.
If you need to contact product support, include the output of the diagnostics script. See Diagnostic utility for details of how to run it.
Health check
Check Cumulocity DataHub Edge backend status
You can check the status of the backend in the Administration page of the Cumulocity DataHub UI. Alternatively you can query the isalive endpoint, which should produce an output similar to:
Dremio is running if OK is returned. No response will be returned if it is not running or inaccessible.
Log files
The installation log file is stored at /var/log/cdh.
In order to access the logs of the Cumulocity DataHub and Dremio containers, you must use the Docker logs command. To follow the logs of cdh-master you must run:
docker logs -f cdh-master
To follow the logs of cdh-executor you must run:
docker logs -f cdh-executor
The containers are configured to rotate log files with rotation settings of two days and a maximum file size of 10 MB.
Monitoring
Cumulocity Edge uses Monit for management and monitoring of relevant processes. See Monitoring for details. The Cumulocity DataHub Edge processes, namely the Cumulocity DataHub backend and the Dremio nodes, are also monitored by Monit.
Data disk management and monitoring
The data disk is used for storing the state of Cumulocity DataHub and Dremio and serves as data lake. In order to ensure that the system can work properly, the disk must not run out of space. The main factors for the disk space allocation of Cumulocity DataHub Edge are the Dremio job profiles and the data lake contents.
Cleanup of Dremio job history
Dremio maintains a history of job details and profiles, which can be inspected in Dremio’s job log, that is, the Jobs page of the Dremio UI. This job history must be cleaned up regularly to free the resources necessary for storing it.
Dremio is configured to perform the cleanup of job results automatically without downtime. The default value for the maximum age of stored job results is seven days. To change that value, a Dremio administrator must modify the support key jobs.max.age_in_days. The changes become effective within 24 hours or after restarting Dremio. See the corresponding Dremio documentation for more details on support keys.
Cleanup of data lake contents
The data lake contents are not automatically purged, as the main purpose of Cumulocity DataHub is to maintain a history of data. However, if disk space is critical and cannot be freed otherwise, parts of the data lake contents must be deleted. Instead of deleting you might also move the data.
Browse to the data lake folder /opt/mongodb/cdh-master/datalake and select the folder whose name equals the target table of the offloading pipeline. The data within the data lake is organized hierarchically, as described in section Folder structure. To free up disk space, delete the chunk folders and all monthly/daily folders up to a point in time fitting to your needs. For example, delete all folders whose filename indicates that the data is older than 1st of January 2024. In general, you must delete complete folders, not single files within a folder. After you delete the folders, you must make Dremio aware of the changed data lake contents. Given the path to your target table, run the following query in Dremio as an administrator:
ALTER PDS <target_table_path> REFRESH METADATA FORCE UPDATE
Caution
Data being deleted from the data lake cannot be recovered anymore.
Backup and Restore
Cumulocity DataHub’s runtime state as well as the data lake containing offloaded data reside in the Cumulocity Edge server VM. In order to back up and restore Cumulocity DataHub, its runtime state, and its data we recommend you to back up and recover the Cumulocity Edge server VM as described in Backup and restore.