Setting up DataHub

This section describes how to set up Cumulocity IoT DataHub.

Prerequisites

Before setting up DataHub, the following prerequisites need to be checked:

You need the connection settings and credentials for a cloud data lake service. During offloading, the data will be written into a data lake folder named after the tenant name.

Info: Instructions on how to configure the data lake so that it is accessible via Dremio are available in the Dremio data source documentation. Note that you must not create the data lake source artefact in Dremio yourself; this is done by DataHub.

The DataHub microservice and web application must be available as applications on your tenant. The web application provides the user interface to configure DataHub and to manage offloading pipelines, while the microservice provides the actual backend implementation for this functionality. The web application and the microservice are named DataHub and Datahub respectively. Both applications are deployed either as:

Subscribed application: the applications were subscribed to the tenant by the management or super tenant
Own application: the applications were added to the tenant

If you have an enterprise tenant, you can also subscribe your sub-tenants to both applications so that the sub-tenants can use DataHub as well.

See section Managing applications for details on managing applications in general, including instructions for:

Adding applications to a tenant
Subscribing applications to a tenant and its sub-tenants
Checking the status and log files of a microservice

Defining DataHub permissions and roles

Dedicated permissions define what a user is allowed to do in DataHub. To ease assigning permissions to users, permissions are grouped in roles. During deployment of the DataHub applications the corresponding permissions as well as roles are created. If a role with the same name already exists, no new role will be created. The same holds for permissions.

If you do not have corresponding DataHub permissions, you will get a warning after login.

Info: When offloading the inventory/events/alarms/measurements collection, DataHub does not incorporate access limitations for these collections as set in the Cumulocity IoT platform. In particular, inventory roles defining permissions to device groups are not incorporated in the offloading process. As a consequence, a user with DataHub permissions can access all data in the data lake irrespective of access restrictions the user has on the base collections.

DataHub roles and permissions

DataHub administrator

The administrator primarily sets up the data lake and Dremio account and conducts administrative tasks like viewing audit logs or monitoring the system status. The administrator can also manage offloading pipelines, e.g., defining and starting a pipeline.

For those tasks the default role DATAHUB_ADMINISTRATOR is created. The permissions for this role are defined as follows:

Type	READ	ADMIN
Datahub administration	yes	yes
Datahub management	yes	yes
Datahub query	yes	no

DataHub manager

The configurator manages offloading pipelines, e.g., defining and starting a pipeline. For those tasks the default role DATAHUB_MANAGER is created. The permissions for this role are defined as follows:

Type	READ	ADMIN
Datahub administration	no	no
Datahub management	yes	yes
Datahub query	yes	no

DataHub user

The user runs queries against the data in the data lake. For details see section Querying offloaded Cumulocity IoT data. To run queries the following approaches can be used:

Dremio UI: The Dremio account defined in section Setting up Dremio account and data lake is used for logging into the UI.
Dremio API: The Dremio account defined in section Setting up Dremio account and data lake is used for authenticating the requests against the Dremio REST API. Software AG does not recommend directly invoking Dremio APIs; they might be removed or changed at any time without prior notice.
DataHub proxy API: The Cumulocity IoT user needs the role DATAHUB_READER in order to run queries using the proxy API.

The permissions for the role DATAHUB_READER are defined as follows:

Type	READ	ADMIN
Datahub administration	no	no
Datahub management	no	no
Datahub query	yes	no

Assignment of DataHub roles and permissions

The roles DATAHUB_ADMINISTRATOR, DATAHUB_MANAGER, and DATAHUB_READER have to be assigned to the respective users of your tenant. For assigning roles to users see section Managing permissions. You need at least one user with the DATAHUB_ADMINISTRATOR role to complete the DataHub configuration.

Info: You do not necessarily need to use the predefined roles to enable Cumulocity IoT users to work with DataHub. Alternatively, you can modify other roles the users are associated with and add the corresponding permissions to those roles. In that case you also have to add the DataHub application to the user’s applications.

Setting up Dremio account and data lake

The setup of DataHub requires you to choose a Dremio account name, and provide credentials to the data lake. In the navigator, select Settings to define those settings.

Info: You need administration permissions to define the settings. See the section on Defining DataHub permissions and roles for details.

The settings whose meaning may not be obvious are equipped with a help icon. Click on the icon to get more information.

Defining new settings

Dremio Account

Under Dremio Account name and password of the Dremio account are defined.

The name is composed of three parts:

tenant id
forward slash
string with a minimum length of two starting with a character and consisting of numbers, characters, dash, or underline

If your tenant id is t12345, then t12345/user is a valid name. The system would also set this value as the initial value in the account field.

The password of the Dremio account has to have at least eight characters, including at least one character and one number.

Data Lake

Depending on the configuration of the environment, the data lake provider is either fixed or you can choose among different providers. For each data lake provider, you have to specify corresponding settings to define the data lake to be used. Once the configuration of the data lake is completed, it cannot be changed afterwards.

The following types of data lakes are currently supported:

Azure Storage is a set of cloud storage services offered by Microsoft. DataHub supports Azure Data Lake Storage Gen2, which is part of these services. The following settings need to be defined for this data lake:

Settings	Description
Azure Storage account name	The name of the Azure storage account
Azure Storage container	The name of the storage container; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-)
Root path	The root path in the data lake under which the offloaded data will be stored; default root path is /; setting a sub-folder allows you to hide other data in the container from DataHub
Azure Storage shared access key	The access key used for authentication

Amazon S3 is an object storage service offered by Amazon Web Services. The following settings need to be defined for this data lake:

Settings	Description
AWS access key	The access key
Access secret	The access secret
Bucket name	The name of the S3 bucket; it must be between 1 and 63 characters long and may contain alphanumeric characters (letters and numbers) as well as dashes (-)
Root path in bucket	The root path within the S3 bucket; default root path is /; setting a sub-folder allows you to hide other data in the bucket from DataHub

NAS is a storage system mounted (NFS, SMB) directly into the Dremio cluster. It is only available for Edge installations. The following settings need to be defined for this data lake:

Settings	Description
Mount path	The mount path refers to a path in the local Linux file system on both the coordinator and executor containers. By default, the file system of Cumulocity IoT Edge is mounted into /datalake inside the containers. To use some other folder, you must map the folder into both containers, e.g. to /datalake inside the containers.

HDFS is the Hadoop Distributed File System, which is a distributed, scalable file system designed for running on commodity hardware. The following settings need to be defined for this data lake:

Settings	Description
Namenode host	The host name of the HDFS NameNode
Namenode port	The port of the HDFS NameNode
Root path	The root path within the HDFS filesystem for storing offloaded data; default root path is /; setting a sub-folder allows you to hide other data in the filesystem from DataHub
Short-circuit local reads	If enabled, Dremio can directly open the HDFS block files; default is disabled
Enable impersonation	If disabled, all requests against HDFS will be made using the user dremio; if enabled, the name of the user logged into Dremio will be used to access HDFS; prerequisite is that the user has rwx-permissions for the given root path
Allow VDS-based access delegation	If enabled, data used in virtual datasets (VDS) will be requested from HDFS using the username of the owner of the VDS; if disabled, the name of the user logged into Dremio is used
Impersonation user delegation	Defines whether an impersonated username is either As is, Lowercase, or Uppercase

Info: Impersonation is supported and used. However, Dremio uses the tenant ID as user name for querying HDFS, not the actual user name. For example, if “t12345/user” is the logged in user, Dremio will use “t12345” for HDFS requests. Thus, granting file system permissions is only possible on a per-tenant basis and not on a per-user basis.

For Azure Storage, Amazon S3, and HDFS data lakes, you can also define additional connection properties. Click Add property and define an additional property consisting of a key/value pair.

Saving settings

Once all settings are defined, click Save in the action bar to the right. During the save process, the following steps are automatically conducted:

A Dremio account is created, with the account having standard Dremio user privileges, not admin privileges.
A data lake source in Dremio is created using the provided data lake settings.
A source in Dremio is created which connects to the Operational Store of Cumulocity IoT.
A space in Dremio is created which you can use to organize your custom Dremio entities, e.g. views.

Editing settings

Editing the settings is not supported. You have to delete the existing settings and define new settings.

Deleting settings

Click Delete in the action bar to delete the settings. During deletion, all Dremio artifacts which were created when saving the settings are deleted. All offloading pipelines and their histories are deleted; active pipelines are deleted after completing the current offloading. The data lake and its contents are not deleted.