Setting up Cumulocity DataHub
This section describes how to set up Cumulocity DataHub.
This section describes how to set up Cumulocity DataHub.
Before setting up Cumulocity DataHub, the following prerequisites must be checked:
The Cumulocity DataHub microservice and web application must be available as applications on your tenant. The web application provides the user interface to configure Cumulocity DataHub and to manage offloading pipelines, while the microservice provides the corresponding backend functionality. The web application is named DataHub, whereas the microservice is named Datahub. Both applications are deployed either as:
If you have an Enterprise tenant, you can also subscribe your subtenants to both applications so that the subtenants can use Cumulocity DataHub as well.
See Managing applications for details on managing Cumulocity applications in general, including instructions for adding applications to a tenant.
See Managing microservices and Monitoring microservices for details on Cumulocity microservices, including instructions for:
See Managing tenants for details on subscribing applications or microservices to a tenant or subtenant.
For the offloading of Cumulocity data, you need the connection settings and credentials for a cloud data lake service. During offloading, the data will be written into a data lake folder named after the tenant name.
Dedicated permissions define what a user is allowed to do in Cumulocity DataHub. To ease assigning permissions to users, permissions are grouped in roles. During deployment of the Cumulocity DataHub applications the corresponding permissions as well as roles are created. If a role with the same name already exists, no new role will be created. The same holds for permissions.
If you do not have corresponding Cumulocity DataHub permissions, you will get a warning after login.
The administrator primarily sets up the data lake and Dremio account and conducts administrative tasks like inspecting audit logs or monitoring the system status. The administrator can also manage offloading pipelines, for example, defining and starting a pipeline.
For those tasks the default role DataHub Administrator is created. The permissions for this role are defined as follows:
Type | READ | ADMIN |
---|---|---|
DataHub administration | yes | yes |
DataHub management | yes | yes |
DataHub query | yes | no |
While READ refers to reading the specific data, ADMIN refers to creating, updating, or deleting the specified data.
The manager manages offloading pipelines such as defining and starting a pipeline. For those tasks the default role DataHub Manager is created. The permissions for this role are defined as follows:
Type | READ | ADMIN |
---|---|---|
DataHub administration | no | no |
DataHub management | yes | yes |
DataHub query | yes | no |
The user executes SQL queries against the data in the data lake. For details on querying the data lake see Querying offloaded Cumulocity data. To execute queries the following approaches can be used:
The permissions for the role DataHub Reader are defined as follows:
Type | READ | ADMIN |
---|---|---|
DataHub administration | no | no |
DataHub management | no | no |
DataHub query | yes | no |
The roles DataHub Administrator, DataHub Manager, and DataHub Reader must be assigned to the respective users of your tenant. For assigning roles to users see Managing permissions. You need at least one user with the DataHub Administrator role to complete the Cumulocity DataHub configuration.
The setup of Cumulocity DataHub requires you to configure a Dremio API user and access to a data lake. In the navigator, select Initial configuration under Settings to define those settings.
In order to access the data lake contents, you can use ODBC, JDBC, Dremio REST API, or a proxy REST API. See Querying offloaded Cumulocity data for more details. The proxy REST API is served by the Cumulocity DataHub server, which acts as a proxy to Dremio. The proxy API requires a Dremio user for the interaction of Cumulocity DataHub server and Dremio. This Dremio API user can then also be used for data lake querying based on JDBC, ODBC, or Dremio REST API.
Therefore, you must configure in the initial configuration under Dremio API user the name and the password of that Dremio API user.
The name is composed of two parts, with the first part being fixed:
The password of the Dremio API user must have at least eight characters, including at least one character and one number.
Your follow-up application might require more than one Dremio user for accessing the data lake. You can define additional Dremio users for that purpose, using the instructions in Adding a Dremio user.
Depending on the configuration of the environment, the data lake provider is either fixed or you can choose among different providers. For each data lake provider, you must specify corresponding settings to define the data lake to be used.
The following types of data lakes are currently supported:
Azure Storage is a set of cloud storage services offered by Microsoft. Cumulocity DataHub supports Azure Data Lake Storage Gen2, which is part of these services. The following settings must be defined for this data lake:
Settings | Description |
---|---|
Azure Storage account name | The name of the Azure storage account |
Azure Storage container | The name of the storage container; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-) |
Root path | The root path within your data lake for storing the offloaded data. With the default path /, data is stored top-level in your storage container. You can also store data in a subfolder, provided the folder already exists. For example, for storage container myContainer and subfolder mySubFolder , use /myContainer/mySubFolder as root path. This option is especially useful to hide other data inside the container from Cumulocity DataHub, for example, when the container is also used by other users or applications. |
Azure Storage shared access key | The access key used for authentication if “Shared Access Key” is used as authentication type |
Application ID | The application ID used for authentication if “Azure Active Directory” is used as authentication type |
OAuth 2.0 Token Endpoint | The authentication endpoint if “Azure Active Directory” is used as authentication type |
Client Secret | The client secret if “Azure Active Directory” is used as authentication type |
While the other settings are fixed once the initial configuration was saved, the authentication type as well as the values of the selected authentication type can be changed afterwards. Click Edit, set new values, and either click Save credentials to save the update or Cancel to keep the old values.
Amazon S3 is an object storage service offered by Amazon Web Services. The following settings must be defined for this data lake:
Settings | Description |
---|---|
AWS access key | The access key |
Access secret | The access secret |
Bucket name | The name of the S3 bucket; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-) |
Root path in bucket | The root path within the S3 bucket; default root path is /; setting a subfolder allows you to hide other data in the bucket from Cumulocity DataHub |
While the other settings are fixed once the initial configuration was saved, the AWS access key and the Access secret can be changed afterwards. Click Edit, set new values, and either click Save credentials to save the update or Cancel to keep the old values.
Server-side encryption is supported while client-side encryption is not. S3 offers three key management mechanisms:
SSE-S3: An AES256 key is generated in S3 and saved alongside the data. Enabling SSE-S3 requires to add the following key-value pair to the Additional Properties sector:
Name: fs.s3a.server-side-encryption-algorithm
Value: AES256
SSE-KMS: An AES256 key is generated in S3, and encrypted with a secret key provided by Amazon’s Key Management Service (KMS). The key must be referenced by name by Cumulocity DataHub. Enabling SSE-KMS requires to add the following key-value pairs to the Additional Properties sector:
Name: fs.s3a.server-side-encryption-algorithm
Value: SSE-KMS
Name: fs.s3a.server-side-encryption.key
Value: Your key name, for example, arn:aws:kms:eu-west-2:123456789012:key/071a86ff-8881-4ba0-9230-95af6d01ca01
SSE-C: The client specifies an base64-encoded AES-256 key to be used to encrypt and decrypt the data. Cumulocity DataHub does not support this option.
NAS is a storage system mounted (NFS, SMB) directly into the Dremio cluster. It is only available for Cumulocity Edge installations. The following settings must be defined for this data lake:
Settings | Description |
---|---|
Mount path | The mount path refers to a path in the local Linux file system on both the coordinator and executor containers. By default, the file system of Cumulocity Edge is mounted into /datalake inside the containers. To use some other folder, you must map the folder into both containers, for example, to /datalake inside the containers. |
HDFS is the Hadoop Distributed File System, which is a distributed, scalable file system designed for running on commodity hardware. The following settings must be defined for this data lake:
Settings | Description |
---|---|
Namenode host | The host name of the HDFS NameNode |
Namenode port | The port of the HDFS NameNode |
Root path | The root path within the HDFS filesystem for storing offloaded data; default root path is /; setting a subfolder allows you to hide other data in the filesystem from Cumulocity DataHub |
Short-circuit local reads | If enabled, Dremio can directly open the HDFS block files; default is disabled |
Enable impersonation | If disabled, all requests against HDFS will be made using the user dremio; if enabled, the tenant name will be used to access HDFS; prerequisite is that the user has rwx-permissions for the given root path. Note that the user dremio is used for some operations even when impersonation is enabled. Thus, it must have appropriate permissions in any case. |
Allow VDS-based access delegation | If enabled, data used in virtual datasets (VDS) will be requested from HDFS using the username of the owner of the VDS; if disabled, the name of the user logged into Dremio is used |
Impersonation user delegation | Defines whether an impersonated username is either As is, Lowercase, or Uppercase |
For Azure Storage, Amazon S3, and HDFS data lakes, you can also define additional connection properties. Click Add property and define an additional property consisting of a key/value pair.
Once all settings are defined, click Save in the action bar to the right. During the save process, the following steps are automatically conducted:
To edit the Dremio API user, click Edit in the Dremio API user section of the Initial configuration page. In the editor you can edit all user details, except for the username, which is fixed. In Editing a Dremio user, all user details are described.
The data lake settings cannot be edited, except for the Azure Storage or Amazon S3 credentials. For editing other values, you must delete the existing settings and define new settings. If you want to keep your offloading configurations, you must export the configurations to a backup file beforehand, delete the settings, define new settings, and import the configurations from the backup file. See Importing/exporting offloading configurations for details on import/export.
Click Delete in the action bar to delete the settings. During deletion, all Dremio artifacts which were created when saving the settings are deleted, including the Dremio API user as well as additionally created Dremio users. Also the artifacts created by a corresponding Dremio user, like views, are deleted. All offloading pipelines and their histories are deleted; active pipelines are deleted after completing the current offloading. As mentioned in the previous section, you can use the import/export functionality to backup your offloading configurations. The data lake and its contents are not deleted, only the Dremio artefacts connecting to the data lake. To delete the data lake and its contents you must use the tooling of your data lake provider.
In the initial configuration of Cumulocity DataHub, the Dremio API user is configured. This user is required for the proxy REST API, which allows you to interact with Dremio using Cumulocity DataHub. This user can also be used to directly interact with Dremio in applications, using JDBC, ODBC, or REST API.
Some use cases might require more than one Dremio user for the interaction with Dremio. For that purpose, additional Dremio users can be added.
In the navigator, select Dremio users under Settings to get an overview of all Dremio users created by an administrator of your Cumulocity DataHub tenant.
The list of Dremio users with their corresponding properties is displayed. The context menu of each user provides actions to edit or delete a user.
If the initial configuration has not been completed yet, no users are shown. If the initial configuration has been completed, the list includes the Dremio API user configured in the initial configuration.
The username is a mandatory setting. It must be a unique value, that is, no other Dremio user has the same username. It consists of the tenant ID plus forward slash and a string with a minimum length of three, starting with a character, and consisting of numbers, characters, dash, or underline. For example, the username may be t47110815/myUser
.
The first name, last name, and email of a Dremio user are optional settings.
During the initial configuration of Cumulocity DataHub, a so-called source in Dremio is created, which connects Dremio with the data lake. Additionally, a so-called space is created in Dremio, in which Dremio artifacts like views can be organized.
The Dremio user can be assigned additional permissions for the data lake source and the space. If the user has the permission for the data lake source assigned, the user can manage grants on that source for other users as well. The same applies to the space permission. Data lake permission and space permission are independent of each other; the setting of one permission does not affect the setting of the other.
Having the corresponding permission assigned, the user can grant other Dremio users, which do not necessarily relate to Cumulocity DataHub, different permissions on the data lake source or the space, for example, for reading data from the data lake or creating a table in the data lake.
For example, IoT data has been offloaded to the data lake using Cumulocity DataHub. A data scientist from a different business unit now wants to access the data lake contents. A Dremio account must be created for the data scientist. Then, a Dremio user created by Cumulocity DataHub, having the data lake permission, grants read access on the data lake source to the Dremio account of the data scientist.
For each user, including the Dremio API user, the manage grants
permissions on data lake and space are initially not set.
The password must have at least 8 characters with at least one letter and one number.
To add a Dremio user, select Dremio users under Settings and click Add user at the right of the top menu bar. In the editor, provide the corresponding Dremio user properties.
Click Save to save the settings and create the new user. Click Cancel to cancel the creation of the user.
The Dremio user section under Settings displays the list of users. For each user, there is a context menu on the right side. Select Edit from that menu to edit a user. Except for the username, all settings can be changed. The password can also optionally be changed by clicking Change password. Click Save to apply the new settings.
In the context menu of the Dremio user list, select Delete and click Confirm in the subsequent confirmation dialog to delete a Dremio user. The Dremio API user defined in the initial configuration cannot be deleted that way. This user can only be deleted if the settings under Initial configuration are deleted. In the latter case, all Dremio users associated with this Cumulocity DataHub instance are deleted.