Skip to content

[R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) #5626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AGLilly opened this issue Dec 8, 2022 · 6 comments

Comments

@AGLilly
Copy link

AGLilly commented Dec 8, 2022

Problem:

I have a binomial prediction problem. One characteristic of my problem is that I have many rows of data with the same dependent and independent criteria. I am trying to speed up learning by using weighted learning (by aggregating (rolling up on all columns) the same data rows together and giving count of the rows as weights). However, When I compare the results of the two exercises, I do not get same result in terms cross entropy, Out of sample error on a different dataset.

Data Formats

Weighted Training data format

weight	entity	outcome	covariates
2	         E1	             0	           EC1
3	         E2	             1	           EC2

Expanded Data training format

entity	outcome	covariates
E1	            0	              EC1
E1	            0	              EC1
E2	            1	              EC2
E2	            1	              EC2
E2	            1 	              EC2
Outcome = 0 -> Failure
Outcome = 1 -> Success

Number of rows in expanded data is equal to sum of weight variable in weighted training

###### Expanded data format LGB run ########

Input_data <- TRAINING DATASET # (Expanded data format in the excel)

## Create LGB Data for the expanded data set format
outcome_name <- "output"
weight_name <- "weight"

Input_data_mat <- Input_data %>% select(-all_of(outcome_name))

# create the lgb data without the weight variable
lgb_data <-  
  lgb.Dataset(
    data = Input_data_mat %>% 
      as.matrix(),
    label = Input_data %>% 
      pull(!!rlang::sym(outcome_name))
  )

params <- list(objective = "cross_entropy",max_bin = 5000,
               learning_rate = 0.01,max_depth = -1,num_leaves = 31)

## Train the model with exactly the same params and No .weight variable
lgb_mod <- lgb.cv(
  params = params,
  data = lgb_data,
  nrounds = 10000,
  early_stopping_rounds = 5
)

## Compare the best iteration and best score
print(glue::glue(" The best iteration is: {lgb_mod$best_iter} & best CE : {lgb_mod$best_score}"))

## Train the lgb model based on the cross validated parameters
lgb_trained_model <- lgb.train(params = params,
                               data = lgb_data,
                               nrounds = lgb_mod$best_iter)

## Generate OOS predictions on the test data
test_data <- TEST DATA for predictions (again in the expanded format)

test_matrix <- test_data %>% 
                        select(-all_of(outcome_name)) %>% as.matrix()                    

## Make predictions on the above datasets
predictions <- predict(lgb_trained_model, test_matrix)

###### WEIGHTED TRAINING SCHEME ########

Input_data <- TRAINING DATASET (Weighted training data format in the excel WITH)

## Create LGB Data for the new data set format
outcome_name <- "output"
weight_name <- ".weight"

## Prepare model matrix for inputtng to lgb.dataset
Input_data_mat <-  Input_data %>% 
  select(-all_of(outcome_name),
         -all_of(weight_name))

# create the lgb data with the weight variable
lgb_data <-  
  lgb.Dataset(
    data = Input_data_mat %>% 
      as.matrix(),
    label = mod_data %>% 
      pull(!!rlang::sym(outcome_name)),
    weight = mod_data %>% 
      pull(!!rlang::sym(weight_name))
  )

params <- list(objective = "cross_entropy",max_bin = 5000,
               learning_rate = 0.01,max_depth = -1,num_leaves = 31)

## Train the model with exactly the same params and No .weight variable
lgb_mod <- lgb.cv(
  params = params,
  data = lgb_data, 
  nrounds = 10000,
  early_stopping_rounds = 5
)

## Compare the best iteration and best score
print(glue::glue(" The best iteration is: {lgb_mod$best_iter} & best CE : {lgb_mod$best_score}"))

## Train the lgb model based on the cross validated parameters
lgb_trained_model <- lgb.train(params = params,
                   data = lgb_data,
                   nrounds = lgb_mod$best_iter)

## Generate OOS predictions on the test data
test_data <- TEST DATA for predictions (again in the weighted format)

## Create LGB Data for the new data set format
outcome_name <- "nbrx_brand"
weight_name <- ".weight"

## Create the new prediction dataset. Remove outcome and weight variables
test_data_pred <- test_data %>% 
  select(-all_of(outcome_name),
         -all_of(weight_name)) %>% 
  as.matrix()

## Make predictions on the above datasets
predictions <- predict(lgb_trained_model, test_data_pred)

## Get predictions from model and weight in the same frame
output <- as.data.frame(cbind(test_data$.weight,predictions))

## Get the total nbrx predicted
output$total <- output$V1 * output$predictions

## Get total sum nbrx for the predictions
print(glue::glue("Total predicted output :{sum(output$total)}"))

When I compare the cross entropy of the two model runs as well as the out of sample predictions. They are very different. OOS Predictions are different by on average 50%.

Questions

  1. Am I doing it correctly?
  2. Is there an example in R for me to replicate?
@jameslamb jameslamb changed the title Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) [R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) Dec 8, 2022
@mayer79
Copy link
Contributor

mayer79 commented Dec 18, 2022

Can you please add three backticks before and after the code for proper formatting?

You will need to remove all regularizations like min_sum_hessian etc. to have a chance that the results match. Not all of them are 0. Interesting question though.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM. Sorry it took so long for someone to respond to you here.

I've reformatted your question to make the difference between code, output from code, and your own words clearer. If you're not familiar with how to do that in markdown, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.

@jameslamb
Copy link
Collaborator

To @mayer79 's point... there are many parameters in LightGBM whose impact on the final model produced are sensitive to the number of rows in the training data, for example:

Parameters evaluated against a sum of row-wise values:

  • min_sum_hessian_in_leaf
  • min_gain_to_split

Parameters evaluated against a count of rows:

  • min_data_in_leaf
  • min_data_in_bin
  • min_data_per_group (for categorical features)

Duplicating rows changes those sums and counts. For example, imagine a dataset with 0 duplicates where you train with min_data_in_leaf = 20 (the default). LightGBM might avoid severe overfitting because it will not add splits that result in a leaf having fewer than 20 cases. Now imagine that you duplicated every row in the dataset 20 times, and retrained without changing the parameters. LightGBM might happily add splits that produced leaves which only matched 20 copies of the same data... effectively memorizing a single specific row in the training data! That'd hurt the generalizability of the trained model.

You can learn about these parameters at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I recommend just proceeding with the approach you described... eliminating identical rows by instead preserving only one row for each unique combination of ([features], target) and using their count (or even better, % of total rows) as a weight. That'll result in less memory usage and faster training, and you should be able to achieve good performance.


NOTE: I did not run your example code, as its quite large and (crucially) doesn't include the actual training data. If you're able to provide a minimal, reproducible example (docs on what that is) showing the performance difference, I'd be happy to investigate further. Please note the word "minimal" there especially... it is really necessary to train for 10,000 rounds, with 5000 bins per feature, to demonstrate the behavior you're asking about? Is it really necessary to use {glue} for just formatting a print() statement, instead of sprintf() or paste0()? If you could cut down the example to something smaller and fully-reproducible, it'd reduce the effort required for us to help you.

@mayer79
Copy link
Contributor

mayer79 commented Sep 10, 2023

@jameslamb: Fantastic explanation, thanks!

@github-actions
Copy link

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants