We are trying to use Apache Cassandra in an IoT based application. We are planning to create a device abstraction. Any user shall be able to define a device with a series of attributes. For each attribute, the user shall be able to define a series of properties like name , data type , minimum value , maximum value etc.
Some examples of devices are given below
Vehicle
The vehicle can have the following attributes
- Speed [name :- speed , data type:- double , minimum value :- 0 , maximum value :-300]
- Latitude [name :- speed , data :- double , minimum :- -90 , maximum :-90]
- Longitude[name :- Longitude, data :- double , minimum :- -180 , maximum :- 180]
Temperature Sensor
The temperature sensor can have the following attributes
- Current Temperature[name :- Current Temperation, data type:- double , minimum value :- 0 , maximum value :-300]
- Unit [name :- Unit , datatype:-string]
In real time , each device will be sending data as key value pairs .
For ex:- A Vehicle can send the following data
Time :- 6/4/2016 11:15:15.150 , Latitude : -1.256 , Longitude :- -180.75, Speed :- 50
Time :- 6/4/2016 11:15:16.150 , Latitude : -1.257 , Longitude :- -181.75, Speed :- 51
For ex:- A Temperature sensor can send the following data
Time :- 6/4/2016 11:15:15.150 , Current Temperature: 100, Unit : farenheit
Time :- 6/4/2016 11:15:16.150 , Latitude : 101 , Unit : farenheit
Since the attributes of different devices can be different , we are confused on how the model the tables in cassandra... Some of the options that came to mind are creating a table for a device, or create a single table and store the values in Map data types... We are little confused on which approach should be taken... Any suggestions is appreciated
3 Answers
Answers 1
Definitely don't create a table per device. I imagine you will end up with 100s/1000s of tables with minimal control over how they are modelled. Cassandra doesn't deal very welll with this as it requires memory for each table, which will reduce the memory available to the key cache and row cache (if you use it).
The map method may be feasible however there are some things to consider before going down that path:
Will a device entry receive frequent updates and how will you update it? If you're planning on updating every single element in the map you will have to update each element individually. The reason for this is that overwrites on collections in Cassandra will create a range tombstone for each overwrite. If frequently overwriting then you will end up with millions of tombstones, which will probably end up not getting compacted away as efficiently as you'd like. This can be avoided by using a JSON type instead, and processing it in your application.
You need to consider how the data will be queried as well, if you want users to be able to query on the data in the map it could get a little more complicated. I think you'd be better off having a single method of querying regardless of device type and then extract details in your application. However this is up to you and is pretty much the driving force to how you structure your data. The best advice I can give is to try and steer clear of creating too many tables, and also be wary of giving your users a lot of control over data structure, as it's very easy to do poorly and cause performance issues on the cluster.
If you haven't already, give this blog a read - it points out the basic elements of data model design that you need to get right when using Cassandra. http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
Answers 2
I think the best option is to create only one table with a general purpose schema for collecting time-serie data.
Example CQL:
CREATE TABLE timeline ( device uuid, time timeuuid, key text, value blob, … PRIMARY KEY ((device, key), time) );
Values can be store as blob (custom serialization), map or numeric scalars, depending on your application use case & data access patterns (how to read/write/delete and if you plan to support to updates).
FYI useful related Datastax posts about time-series modeling:
Answers 3
Have you looked at using the different Collection Data types in Cassandra to store the information that differs between devices?
https://docs.datastax.com/en/cql/3.0/cql/cql_using/use_collections_c.html
0 comments:
Post a Comment