standardize function behavior suggestion (#187) · Issues · public_projects / ketos

standardize function behavior suggestion

I would like to suggest some changes to the standardize function.

The first change is pretty straightforward, and it is to remove mapper argument in favour of simply using pandas .rename function. Therefore the user would have to prepare the columns name before calling the standardize function. I favour not having these wrappers around simple operations that exist in other packages. I think it simplifies our functions and make it easier overall to include it in the users code.

The second change would be bigger. It would be going back to the label and start_label_at_1 arguments. The problem with the current implementation is that it limits the user to generating labels starting from 0 or starting from 1. While this seems ideal as the neural networks need incremental labels starting from 0, it creates potential problems in the data processing pipeline.

The problem i often run into is adding more data to an already pre-existing database. For example: Lets assume I already have a database with 3 labels 0, 1 and 2. Now lets say i want to add more data to my database with only label '2'. This is not possible with the current implementation of the standardize function. I have to do it manually. The reason for this is that standardize will map the label '2' to '0' (or '1' if i use the start_label_at_1) because the annotations for the new data i want to add only has one label. I can provide an example of this if it is not clear.

My proposed change to solve this issue but also make the standardize function a lot more flexible is one of the following:

Simply remove these two arguments. The labels can be handled later with the output_transform_function() to map it to integers. We would just have to write proper documentation on how to use that function.
remove the start_label_at_1 argument, and change the label argument to be the following functionality:
1. If label=None, do nothing. This allows the user the freedom to manually change the labels if they wish before calling standardize.
2. label=Dict. the user passes a dictionary to map the labels. this allows the user greater freedom in managing the labels
3. label=list. Current functionality where the labels listed on the list are mapped to 0,1,2... etc.

I would favour this second option. What do you guys think?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information