Setting up Cumulocity DataHub

This section describes how to set up Cumulocity DataHub.

Prerequisites

Before setting up Cumulocity DataHub, the following prerequisites must be checked:

The Cumulocity DataHub microservice and web application must be available as applications on your tenant. The web application provides the user interface to configure Cumulocity DataHub and to manage offloading pipelines, while the microservice provides the corresponding backend functionality. The web application is named DataHub, whereas the microservice is named Datahub. Both applications are deployed either as:

  • Subscribed application: the applications were subscribed to the tenant by the management or super tenant
  • Custom application: the applications were added to the tenant

If you have an Enterprise tenant, you can also subscribe your subtenants to both applications so that the subtenants can use Cumulocity DataHub as well.

See Managing applications for details on managing Cumulocity applications in general, including instructions for adding applications to a tenant.

See Managing microservices and Monitoring microservices for details on Cumulocity microservices, including instructions for:

  • Adding microservices to a tenant
  • Checking the status, permissions, and log files of a microservice

See Managing tenants for details on subscribing applications or microservices to a tenant or subtenant.

For the offloading of Cumulocity data, you need the connection settings and credentials for a cloud data lake service. During offloading, the data will be written into a data lake folder named after the tenant name.

Info
This section provides instructions on how to configure the data lake so that it is accessible via Dremio. More details can be found in the Dremio data source documentation. Note that you must not create the target table, which connects to the data lake, in Dremio; this is done by Cumulocity DataHub.

Defining Cumulocity DataHub permissions and roles

Dedicated permissions define what a user is allowed to do in Cumulocity DataHub. To ease assigning permissions to users, permissions are grouped in roles. During deployment of the Cumulocity DataHub applications the corresponding permissions as well as roles are created. If a role with the same name already exists, no new role will be created. The same holds for permissions.

If you do not have corresponding Cumulocity DataHub permissions, you will get a warning after login.

Important
When offloading the inventory/events/alarms/measurements collection, Cumulocity DataHub does not incorporate access limitations for these collections as set in the Cumulocity platform. In particular, inventory roles defining permissions to device groups are not incorporated in the offloading process. As a consequence, a user with Cumulocity DataHub permissions can access all data in the data lake irrespective of access restrictions the user has on the base collections.

Cumulocity DataHub roles and permissions

Cumulocity DataHub administrator

The administrator primarily sets up the data lake and Dremio account and conducts administrative tasks like inspecting audit logs or monitoring the system status. The administrator can also manage offloading pipelines, for example, defining and starting a pipeline.

For those tasks the default role DataHub Administrator is created. The permissions for this role are defined as follows:

Type READ ADMIN
DataHub administration yes yes
DataHub management yes yes
DataHub query yes no

While READ refers to reading the specific data, ADMIN refers to creating, updating, or deleting the specified data.

Cumulocity DataHub manager

The manager manages offloading pipelines such as defining and starting a pipeline. For those tasks the default role DataHub Manager is created. The permissions for this role are defined as follows:

Type READ ADMIN
DataHub administration no no
DataHub management yes yes
DataHub query yes no

Cumulocity DataHub user

The user executes SQL queries against the data in the data lake. For details on querying the data lake see Querying offloaded Cumulocity data. To execute queries the following approaches can be used:

  • Dremio UI: The Dremio account defined in Setting up Dremio users is used for logging into the Dremio UI and executing queries within that UI.
  • Dremio API: Queries can also be executed using the Dremio REST API. The Dremio account defined in Setting up Dremio users is used for authenticating the requests against that API. Directly invoking Dremio APIs is discouraged; they might be removed or changed at any time without prior notice.
  • Cumulocity DataHub proxy API: Cumulocity DataHub provides an API which proxies requests to the Dremio API. The Cumulocity user needs the role DataHub Reader in order to execute queries using the proxy API. The authentication against Dremio is done behind the scenes.

The permissions for the role DataHub Reader are defined as follows:

Type READ ADMIN
DataHub administration no no
DataHub management no no
DataHub query yes no

Assignment of Cumulocity DataHub roles and permissions

The roles DataHub Administrator, DataHub Manager, and DataHub Reader must be assigned to the respective users of your tenant. For assigning roles to users see Managing permissions. You need at least one user with the DataHub Administrator role to complete the Cumulocity DataHub configuration.

Info
You do not necessarily need to use the predefined roles to enable Cumulocity users to work with Cumulocity DataHub. Alternatively, you can modify other roles the users are associated with and add the corresponding permissions to those roles. In that case you also must add the DataHub application to the user’s applications.

Setting up the initial configuration

The setup of Cumulocity DataHub requires you to configure a Dremio API user and access to a data lake. In the navigator, select Initial configuration under Settings to define those settings.

Requirements
You need administration permissions to define the settings. See Defining Cumulocity DataHub permissions and roles for details.

Defining the initial configuration

Dremio API user

In order to access the data lake contents, you can use ODBC, JDBC, Dremio REST API, or a proxy REST API. See Querying offloaded Cumulocity data for more details. The proxy REST API is served by the Cumulocity DataHub server, which acts as a proxy to Dremio. The proxy API requires a Dremio user for the interaction of Cumulocity DataHub server and Dremio. This Dremio API user can then also be used for data lake querying based on JDBC, ODBC, or Dremio REST API.

Therefore, you must configure in the initial configuration under Dremio API user the name and the password of that Dremio API user.

The name is composed of two parts, with the first part being fixed:

  1. Tenant ID plus forward slash
  2. String with a minimum length of three, starting with a character, and consisting of numbers, characters, dash, or underline

The password of the Dremio API user must have at least eight characters, including at least one character and one number.

Info
When using the proxy REST API, all queries are processed using the same Dremio API user. The queries are listed in the query log. Thus, the log shows all queries of all users having leveraged the proxy API.

Your follow-up application might require more than one Dremio user for accessing the data lake. You can define additional Dremio users for that purpose, using the instructions in Adding a Dremio user.

Data Lake

Depending on the configuration of the environment, the data lake provider is either fixed or you can choose among different providers. For each data lake provider, you must specify corresponding settings to define the data lake to be used.

Requirements
The setting Partition Column Inference must not be enabled as this lets Dremio assume a specific folder structure, which conflicts with the folder structure used by Cumulocity DataHub.

The following types of data lakes are currently supported:

Azure Storage

Azure Storage is a set of cloud storage services offered by Microsoft. Cumulocity DataHub supports Azure Data Lake Storage Gen2, which is part of these services. The following settings must be defined for this data lake:

Settings Description
Azure Storage account name The name of the Azure storage account
Azure Storage container The name of the storage container; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-)
Root path The root path within your data lake for storing the offloaded data. With the default path /, data is stored top-level in your storage container. You can also store data in a subfolder, provided the folder already exists. For example, for storage container myContainer and subfolder mySubFolder, use /myContainer/mySubFolder as root path. This option is especially useful to hide other data inside the container from Cumulocity DataHub, for example, when the container is also used by other users or applications.
Azure Storage shared access key The access key used for authentication if “Shared Access Key” is used as authentication type
Application ID The application ID used for authentication if “Azure Active Directory” is used as authentication type
OAuth 2.0 Token Endpoint The authentication endpoint if “Azure Active Directory” is used as authentication type
Client Secret The client secret if “Azure Active Directory” is used as authentication type

While the other settings are fixed once the initial configuration was saved, the authentication type as well as the values of the selected authentication type can be changed afterwards. Click Edit, set new values, and either click Save credentials to save the update or Cancel to keep the old values.

Requirements
Note that the account type must be StorageV2, and the Hierarchical namespace feature must be activated for the corresponding Azure Storage account. It is for performance reasons recommended to set the Blob access tier to Hot. Also note that in case IP white-listing is activated, Cumulocity DataHub might not be able to access the data lake if the data lake and Cumulocity DataHub reside in the same Azure region. See also the corresponding documentation.
Amazon S3

Amazon S3 is an object storage service offered by Amazon Web Services. The following settings must be defined for this data lake:

Settings Description
AWS access key The access key
Access secret The access secret
Bucket name The name of the S3 bucket; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-)
Root path in bucket The root path within the S3 bucket; default root path is /; setting a subfolder allows you to hide other data in the bucket from Cumulocity DataHub

While the other settings are fixed once the initial configuration was saved, the AWS access key and the Access secret can be changed afterwards. Click Edit, set new values, and either click Save credentials to save the update or Cancel to keep the old values.

Requirements
An S3 bucket with default settings works. If specific security policies are applied, make sure that the minimum policy requirements listed in https://docs.dremio.com/current/sonar/data-sources/object/s3 are satisfied.

Server-side encryption is supported while client-side encryption is not. S3 offers three key management mechanisms:

SSE-S3: An AES256 key is generated in S3 and saved alongside the data. Enabling SSE-S3 requires to add the following key-value pair to the Additional Properties sector:
Name: fs.s3a.server-side-encryption-algorithm
Value: AES256

SSE-KMS: An AES256 key is generated in S3, and encrypted with a secret key provided by Amazon’s Key Management Service (KMS). The key must be referenced by name by Cumulocity DataHub. Enabling SSE-KMS requires to add the following key-value pairs to the Additional Properties sector:
Name: fs.s3a.server-side-encryption-algorithm
Value: SSE-KMS

Name: fs.s3a.server-side-encryption.key
Value: Your key name, for example, arn:aws:kms:eu-west-2:123456789012:key/071a86ff-8881-4ba0-9230-95af6d01ca01

SSE-C: The client specifies an base64-encoded AES-256 key to be used to encrypt and decrypt the data. Cumulocity DataHub does not support this option.

NAS

NAS is a storage system mounted (NFS, SMB) directly into the Dremio cluster. It is only available for Cumulocity Edge installations. The following settings must be defined for this data lake:

Settings Description
Mount path The mount path refers to a path in the local Linux file system on both the coordinator and executor containers. By default, the file system of Cumulocity Edge is mounted into /datalake inside the containers. To use some other folder, you must map the folder into both containers, for example, to /datalake inside the containers.
HDFS

HDFS is the Hadoop Distributed File System, which is a distributed, scalable file system designed for running on commodity hardware. The following settings must be defined for this data lake:

Settings Description
Namenode host The host name of the HDFS NameNode
Namenode port The port of the HDFS NameNode
Root path The root path within the HDFS filesystem for storing offloaded data; default root path is /; setting a subfolder allows you to hide other data in the filesystem from Cumulocity DataHub
Short-circuit local reads If enabled, Dremio can directly open the HDFS block files; default is disabled
Enable impersonation If disabled, all requests against HDFS will be made using the user dremio; if enabled, the tenant name will be used to access HDFS; prerequisite is that the user has rwx-permissions for the given root path. Note that the user dremio is used for some operations even when impersonation is enabled. Thus, it must have appropriate permissions in any case.
Allow VDS-based access delegation If enabled, data used in virtual datasets (VDS) will be requested from HDFS using the username of the owner of the VDS; if disabled, the name of the user logged into Dremio is used
Impersonation user delegation Defines whether an impersonated username is either As is, Lowercase, or Uppercase
Info
Impersonation is supported and may be used. However, when impersonation is enabled, Dremio uses the tenant ID as username for querying HDFS, not the actual username. For example, if “t12345/user” is the logged in user, Dremio will use “t12345” for HDFS requests. Thus, granting file system permissions is only possible on a per-tenant basis and not on a per-user basis. Also note that the user dremio is used for some operations even when impersonation is enabled. Thus, it must have appropriate permissions in any case.

For Azure Storage, Amazon S3, and HDFS data lakes, you can also define additional connection properties. Click Add property and define an additional property consisting of a key/value pair.

Saving settings

Once all settings are defined, click Save in the action bar to the right. During the save process, the following steps are automatically conducted:

  • A Dremio API user is created; the user has standard Dremio user privileges, not admin privileges.
  • A data lake connection in Dremio is created using the provided data lake settings. For Dremio that connection is technically spoken a source. In our context we refer to it as target table as this data lake is used for storing the offloaded data.
  • A source in Dremio is created which connects to the Operational Store of Cumulocity. That source is not visible to the Dremio API user.
  • A space in Dremio is created which you can use to organize your custom Dremio entities such as views. The name of the space is your tenant ID concatenated with ‘Space’, for example, t12345Space.

Editing settings

To edit the Dremio API user, click Edit in the Dremio API user section of the Initial configuration page. In the editor you can edit all user details, except for the username, which is fixed. In Editing a Dremio user, all user details are described.

The data lake settings cannot be edited, except for the Azure Storage or Amazon S3 credentials. For editing other values, you must delete the existing settings and define new settings. If you want to keep your offloading configurations, you must export the configurations to a backup file beforehand, delete the settings, define new settings, and import the configurations from the backup file. See Importing/exporting offloading configurations for details on import/export.

Deleting settings

Click Delete in the action bar to delete the settings. During deletion, all Dremio artifacts which were created when saving the settings are deleted, including the Dremio API user as well as additionally created Dremio users. Also the artifacts created by a corresponding Dremio user, like views, are deleted. All offloading pipelines and their histories are deleted; active pipelines are deleted after completing the current offloading. As mentioned in the previous section, you can use the import/export functionality to backup your offloading configurations. The data lake and its contents are not deleted, only the Dremio artefacts connecting to the data lake. To delete the data lake and its contents you must use the tooling of your data lake provider.

Setting up Dremio users

In the initial configuration of Cumulocity DataHub, the Dremio API user is configured. This user is required for the proxy REST API, which allows you to interact with Dremio using Cumulocity DataHub. This user can also be used to directly interact with Dremio in applications, using JDBC, ODBC, or REST API.

Some use cases might require more than one Dremio user for the interaction with Dremio. For that purpose, additional Dremio users can be added.

Requirements
You need administration permissions to configure Dremio users. See Defining Cumulocity DataHub permissions and roles for details.

Overview of Dremio users

In the navigator, select Dremio users under Settings to get an overview of all Dremio users created by an administrator of your Cumulocity DataHub tenant.

The list of Dremio users with their corresponding properties is displayed. The context menu of each user provides actions to edit or delete a user.

If the initial configuration has not been completed yet, no users are shown. If the initial configuration has been completed, the list includes the Dremio API user configured in the initial configuration.

Properties of a Dremio user

Username

The username is a mandatory setting. It must be a unique value, that is, no other Dremio user has the same username. It consists of the tenant ID plus forward slash and a string with a minimum length of three, starting with a character, and consisting of numbers, characters, dash, or underline. For example, the username may be t47110815/myUser.

First name, last name, and email

The first name, last name, and email of a Dremio user are optional settings.

Permissions for data lake and space

During the initial configuration of Cumulocity DataHub, a so-called source in Dremio is created, which connects Dremio with the data lake. Additionally, a so-called space is created in Dremio, in which Dremio artifacts like views can be organized.

The Dremio user can be assigned additional permissions for the data lake source and the space. If the user has the permission for the data lake source assigned, the user can manage grants on that source for other users as well. The same applies to the space permission. Data lake permission and space permission are independent of each other; the setting of one permission does not affect the setting of the other.

Having the corresponding permission assigned, the user can grant other Dremio users, which do not necessarily relate to Cumulocity DataHub, different permissions on the data lake source or the space, for example, for reading data from the data lake or creating a table in the data lake.

Caution
Granting permissions to other users should be done very carefully in order to avoid that sensitive information is exposed to the wrong users. In particular, permissions should never be granted to all users as in that case all Dremio users of the Cumulocity instance can access the data lake source or space respectively.

For example, IoT data has been offloaded to the data lake using Cumulocity DataHub. A data scientist from a different business unit now wants to access the data lake contents. A Dremio account must be created for the data scientist. Then, a Dremio user created by Cumulocity DataHub, having the data lake permission, grants read access on the data lake source to the Dremio account of the data scientist.

Info
Dremio refers to the permissions as privileges. Privileges include for example SELECT, ALTER, CREATE TABLE, or DROP. A Dremio user with the corresponding permissions can grant permissions to other users via the Dremio UI. In the UI, browse to the data lake or space and select Edit details in the context menu. In the editor, the list of privileges for all users is shown, with the option to update privileges and users. Alternatively, you can use the Dremio SQL API to modify privileges.

For each user, including the Dremio API user, the manage grants permissions on data lake and space are initially not set.

Password

The password must have at least 8 characters with at least one letter and one number.

Adding a Dremio user

To add a Dremio user, select Dremio users under Settings and click Add user at the right of the top menu bar. In the editor, provide the corresponding Dremio user properties.

Click Save to save the settings and create the new user. Click Cancel to cancel the creation of the user.

Editing a Dremio user

The Dremio user section under Settings displays the list of users. For each user, there is a context menu on the right side. Select Edit from that menu to edit a user. Except for the username, all settings can be changed. The password can also optionally be changed by clicking Change password. Click Save to apply the new settings.

Deleting a Dremio user

In the context menu of the Dremio user list, select Delete and click Confirm in the subsequent confirmation dialog to delete a Dremio user. The Dremio API user defined in the initial configuration cannot be deleted that way. This user can only be deleted if the settings under Initial configuration are deleted. In the latter case, all Dremio users associated with this Cumulocity DataHub instance are deleted.