-
Notifications
You must be signed in to change notification settings - Fork 58
DataGen Object
fernando edited this page Feb 22, 2015
·
2 revisions
The DataGen object is used both during the training and predicting stages of the classifier. It defines basic characteristic about the dataset and how it is supposed to be handled. In particular, the names of the target field, features for identification purposes that are not used by the model (such as users ids), categorical and numerical features, are passed to the DataGen object to allow it to extract the relevant data fields properly. Moreover, the option to apply an arbitrary (online)transformation to a given feature is also present.
DataGen(self, max_features, target, descriptive=(), categorical=(), numerical=None, transformation=None)
Parameters:
- max_features: Determines the maximum total number of features. Categorical features will be hashed modulus max_features - number of numerical and other features.
- target: The name of the target variable in the csv file.
- descriptive: A tuple with the names of all features that will not be used in the model (such as user ids).
- categorical: The name of all categorical features to be hashed.
- numerical: If present, name of numerical features.
- transformation: A map (python dict) of the name of the new variable to a tuple containing the name of the original variable to be transformed and the function to be applied to it.
Example: