Refining offloaded Cumulocity data

In addition to SQL querying using standard interfaces, you can utilize Dremio functionality to further refine and curate your (device) data.

For a detailed description of all functionalities Dremio provides you can consult the Dremio documentation.

Accessing and logging into Dremio

You access Dremio via a web browser. It has been tested with the following web browsers:

Info: Support for mobile devices like smartphones or tablets has not been tested.

To access Dremio, navigate to the home page. Under Quick links click on the Dremio icon. This will direct you to the Login screen of Dremio.

Info: Your Dremio user does not have administration rights in Dremio.

How to log into Dremio

On the Login screen, enter the same credentials as for your DataHub Console. Click Login to enter Dremio.

When you log in successfully, you will be taken to the home page of Dremio.

When you want to log out, click on your user name at the right of the top bar and then select Log out.

Sources and spaces

On the home page of Dremio you will find at the left under Datasets two panels called Spaces and Sources respectively.

Sources

In the Sources panel there is the data source <YourAccountName>DataLake. This source has been auto-configured for you and points to your data lake.

Info: Terminology-wise Cumulocity IoT DataHub replicates data from Cumulocity Operational Store into the data lake. For Dremio the data lake and its target tables is a data source as it allows reading data from it.

When you click on your data source it will be shown in the main panel. Clicking on the source in the main panel navigates into the data source. Here, you see a list of all target tables of your offloading pipelines. Clicking one of these target tables opens an SQL editor which allows you to run queries against that target table.

Info: You will also see a folder named c8y_cdh_temp. The folder is used for DataHub-internal purposes and must not be deleted or altered.

Spaces

A space in Dremio helps organizing your data sets. Cumulocity IoT DataHub auto-configures a space which is named YourAccountNameSpace. A dataset in the space is referred in queries as YourAccountNameSpace.YourDataset. As described in section Offloading of inventory collection, the inventory collection already ships two views “inventory_all” and “inventory_latest” which are available in your space.

Job history

The Job History tab on top of the screen displays jobs/queries you have executed. It allows you to view details of a job by clicking on the particular job, and offers filter capabilities (time range, job status, query type, and queue). The Profile view inside the job detail view is very useful to investigate optimization potentials in your queries.

Info: The job history only contains queries that you actively run; the jobs related to the data extraction are hidden.

Creating views

With Cumulocity IoT DataHub, you can replicate data from a Cumulocity collection to a data lake using a default transformation of the data. As requirements for subsequent data analysis of the offloaded device data may vary over time, you should configure your offloading pipeline so that all potentially relevant data is included.

Depending on your use cases, you will often find the need to provide a view on the data, which limits, filters, or transforms the data, e.g. converting Celsius to Fahrenheit or extracting data from JSON fragments.

In Dremio, you can create such a view by defining a corresponding query and saving it as a new dataset. When saving that new dataset, you must select your space as the location and can freely choose a name for the view. Once that is done, you can work with the new dataset as with any other source and run queries against it. This includes in particular querying this view from other clients as described in section Querying Offloaded Cumulocity Data.

Info: Such a view is per default not materialized, i.e., it is not stored persistently. Each time you query the view, the underlying query defining the view is run against the source data.

Example

Consider the case that you want to visualize data in a reporting tool. The raw data volume is too high, so you want to instead show the hourly average of the column ‘myValue’. You can easily do that by creating a view with the following SQL statement and saving it as a view/virtual data set:

SELECT DATE_TRUNC('HOUR', "time") as "time", avg(myValue) as avg
FROM myTable
GROUP BY DATE_TRUNC('HOUR', "time")

Joining tables/views

Views you have defined and target tables from your data lake can be joined as well. In Dremio you can either define joins using the SQL editor or use a graphical interface to define a join.

A general use case for joining is to enrich your alarms, events, or measurement values with metadata from the inventory collection.

SELECT *
FROM myAccountDataLake.myAccount.alarms
JOIN myAccountDataLake.myAccount.inventory
USING(id)