This repo contains the code for The RADAr model. RADAR is a sequence-to-sequence model for Hierarchical Text Classification.
Here is a brief description about the funcationality of the different modules in the repo.
- We use the same data splits and preprocessing as HBGL. So first, the preprocessing done for HBGL must be run on each dataset.
- The prepare_data.py module from the data_preparation directory should be used. As a result, each sample will be split into text and lable and six files will be created. train.src, train.tgt, val.src, val.tgt, test.src, test.tgt. The .src files contains the text while .tgt files contains the labels.
- To get the data statistics, the datasets_statistics notebook can be used.
- The organize_labels_level_wise.py module seperates the labels level-wise.
- The organize_labels_path_wise.py" module organized the labels in seperate paths.
- The sweep_code is used for hyperparameter optimization.
- The T5_BART_exps.py is the module used to get the resutls using T5 and BART model.
- The "analyze_resutls" notebook is used to analyze the model resutls.
To run the
- After preparing the data using the data preparation modules.
- Specify the data, log, model and results directories and other parameters in the yaml file corresponsing to each dataset.
- To train and test the model, please use the comman: python main.py xxx
where "xxx" is a three letter acronym for the dataset. It is either wos, nyt, or rcv indicating the three datasets WOS, NTY and RCV1-V2 respectively.