Setting up DataHub
This section describes how to set up Cumulocity IoT DataHub.
This section describes how to set up Cumulocity IoT DataHub.
Before setting up DataHub, the following prerequisites need to be checked:
You need the connection settings and credentials for a cloud data lake service. During offloading, the data will be written into a folder named after the tenant name to this data lake.
Info: Instructions on how to configure the data lake so that it is accessible via Dremio are available in the Dremio data source documentation. Note that you must not create the data lake source artefact in Dremio yourself; this is done by DataHub.
The DataHub microservice and web application must be available as applications on your tenant. The web application provides the user interface to configure DataHub and to manage offloading pipelines, while the microservice provides the actual backend implementation for this functionality. The web application and the microservice are both named DataHub. Both applications are deployed either as:
If you have an enterprise tenant, you can also subscribe your sub-tenants to both applications so that the sub-tenants can use DataHub as well.
See section Managing applications for details on managing applications in general, including instructions for:
Dedicated permissions define what a user is allowed to do in DataHub. To ease assigning permissions to users, permissions are grouped in roles. During deployment of the DataHub applications the corresponding permissions as well as roles are created. If a role with the same name already exists, no new role will be created. The same holds for permissions.
The administrator primarily sets up the data lake and Dremio account and conducts administrative tasks like viewing audit logs or monitoring the system status. The administrator can also manage offloading pipelines, e.g., defining and starting a pipeline.
For those tasks the default role DATAHUB_ADMINISTRATOR is created. The permissions for this role are defined as follows:
Type | READ | ADMIN |
---|---|---|
Cdh configure | yes | yes |
Cdh manage | yes | yes |
Cdh use | yes | no |
The configurator manages offloading pipelines, e.g., defining and starting a pipeline. For those tasks the default role DATAHUB_MANAGER is created. The permissions for this role are defined as follows:
Type | READ | ADMIN |
---|---|---|
Cdh configure | yes | yes |
Cdh manage | no | no |
Cdh use | yes | no |
The user runs queries against the data in the data lake. For details see section Querying offloaded Cumulocity IoT data. To run queries the following approaches can be used:
The permissions for the role DATAHUB_READER are defined as follows:
Type | READ | ADMIN |
---|---|---|
Cdh configure | no | no |
Cdh manage | no | no |
Cdh use | yes | no |
The roles DATAHUB_ADMINISTRATOR, DATAHUB_MANAGER, and DATAHUB_READER have to be assigned to the respective users of your tenant. For assigning roles to users see section Managing permissions. You need at least one user with the DATAHUB_ADMINISTRATOR role to complete the DataHub configuration.
Info: You do not necessarily need to use the predefined roles to enable Cumulocity IoT users to work with DataHub. Alternatively, you can modify other roles the users are associated with and add the corresponding permissions to those roles. In that case you also have to add the DataHub application to the user’s applications.
The setup of DataHub requires the administrator to choose a Dremio account name, and provide credentials to the data lake. In the navigator, select Settings to define those settings.
Under Dremio Account name and password of the Dremio account are defined.
The name is composed of three parts:
If your tenant id is t12345
, then t12345/user
is a valid name. The system would also set this value as the initial value in the account field.
The password of the Dremio account has to have at least eight characters, including at least one character and one number.
The type of data lake to be used is preconfigured for the DataHub microservice; the type cannot be changed afterwards. Depending on the data lake type, you have to specify different settings.
The following types of data lakes are currently supported:
Azure Data Lake Storage Gen1 is a repository for big data analytic workloads offered by Microsoft. The following settings need to be defined for this data lake:
Settings | Description |
---|---|
Data Lake Store resource name | The name of the instance created in Azure Data Lake |
Application ID | The ID of the registered application under Azure Active Directory |
OAuth 2.0 token endpoint | The OAuth 2.0 authentication endpoint for registered applications |
Root path | The root path in the data lake under which the offloaded data will be stored |
Access key value | The password for the registered application |
Azure Storage is a set of cloud storage services offered by Microsoft. DataHub supports Azure Data Lake Storage Gen2, which is part of these services. The following settings need to be defined for this data lake:
Settings | Description |
---|---|
Azure Storage account name | The name of the Azure storage account |
Azure Storage container | The name of the storage container; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-) |
Root path | The root path in the data lake under which the offloaded data will be stored |
Azure Storage shared access key | The access key used for authentication |
Amazon S3 is an object storage service offered by Amazon Web Services. The following settings need to be defined for this data lake:
Settings | Description |
---|---|
AWS access key | The access key |
Access secret | The access secret |
Bucket name | The name of the S3 bucket; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-) |
Root path in bucket | The root path within the S3 bucket |
NAS is a storage system mounted (NFS, SMB) directly into the Dremio cluster. It is only available for on-premise installations. The following settings need to be defined for this data lake:
Settings | Description |
---|---|
Mount path | The mount path of the NAS |
For Azure Data Lake Storage Gen1, Azure Storage, and Amazon S3 data lakes, you can also define additional connection properties. Click Add property and define an additional property consisting of a key/value pair.
Once all settings are defined, click Save in the action bar to the right. During the save process, the following steps are automatically conducted:
Editing the settings is not supported. You have to delete the old settings and define new settings.
Click Delete in the action bar to delete the settings. During deletion, all Dremio artifacts which were created when saving the settings are deleted. All offloading pipelines and their histories are deleted; active pipelines are deleted after completing the current offloading. The data lake and its contents are not deleted.