Creator: by Florian_Dietz
The comment of this Tag describes the datatype of a column.
Valid values are:
-primitive values like int32, int64, float32, float64, boolean, string, datetime, ...
-special values: daily_date and daily_delta. Both of these are more specific variants of datetime and timedelta, where all values are full days (the hours, minutes, etc. are all 0).
-categorical: Different values represent different categories.
-numeric: Values can be ordered and basic mathematical functions can be applied to them.
-discrete: Values are numeric and there is a finite number of values between any two values.
-continuous: Values are numeric and there can be an infinite number of values between two values.
-candidate_key: A different value for every entry and no null entries. Note that info_primary_key is used to mark the primary key separately.
The purpose of this Tag is to inform other Rules and Programs so that they can make intelligent decision. Therefore, being useful is more important than being correct. An integer indicator variable that only has the value 0 and 1, for example, is technically numeric and discrete, but it may make more sense to mark it as categorical instead. The same is true for IDs and for other kinds of numeric values that fall into a suspiciously small number of clusters of different values.
If multiple datatypes apply, separate them with commas. Add both leading and trailing commas to make search easier and safer by avoiding accidental prefix/postfix matches. Examples:
-The name of the customer: \",string,categorical,\"
-the ID of the item the customer bought: \",candidate_key,int64,int,categorical,\"
-The date and time on which an item was bought: \",datetime,continuous,numeric,\"
-The date on which the item is due (without a time component): \",daily_date,datetime,discrete,numeric,\"
-The number of items the customer bought: \",int64,int,discrete,numeric,\"
-The price the customer paid: \",float64,float,continuous,numeric,\"
-The discount of the item: \",float64,float,continuous,numeric,categorical,\"
Note the last example: If the discount is a number, but there are only a fixed number of different discount values, it can be useful both as a numeric value and as a categorical value, depending on the algorithm you want to use.
Note that the datatype of the column should be flexible enough to allow all ways of treating the column specified by this Tag. For example, if we are dealing with a Pandas DataFrame and a column is marked as both numeric and categorical, then the column in the Pandas DataFrame should not be of type 'category' as that would make numeric operations impossible.
The comment can be set to \"TBD\" (for to-be-determined) to indicate that this value needs to be redetermined. This is useful if a program alters the column but does not know how to describe the datatype. It leaves making that description up to other programs.
Arguments of this Tag:
0 : A column Tag.
Created: Jan. 11, 2019, 4:25 p.m.