-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) #5626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you please add three backticks before and after the code for proper formatting? You will need to remove all regularizations like |
Thanks for using LightGBM. Sorry it took so long for someone to respond to you here. I've reformatted your question to make the difference between code, output from code, and your own words clearer. If you're not familiar with how to do that in markdown, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax. |
To @mayer79 's point... there are many parameters in LightGBM whose impact on the final model produced are sensitive to the number of rows in the training data, for example: Parameters evaluated against a sum of row-wise values:
Parameters evaluated against a count of rows:
Duplicating rows changes those sums and counts. For example, imagine a dataset with 0 duplicates where you train with You can learn about these parameters at https://lightgbm.readthedocs.io/en/latest/Parameters.html. I recommend just proceeding with the approach you described... eliminating identical rows by instead preserving only one row for each unique combination of NOTE: I did not run your example code, as its quite large and (crucially) doesn't include the actual training data. If you're able to provide a minimal, reproducible example (docs on what that is) showing the performance difference, I'd be happy to investigate further. Please note the word "minimal" there especially... it is really necessary to train for 10,000 rounds, with 5000 bins per feature, to demonstrate the behavior you're asking about? Is it really necessary to use |
@jameslamb: Fantastic explanation, thanks! |
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Uh oh!
There was an error while loading. Please reload this page.
Problem:
I have a binomial prediction problem. One characteristic of my problem is that I have many rows of data with the same dependent and independent criteria. I am trying to speed up learning by using weighted learning (by aggregating (rolling up on all columns) the same data rows together and giving count of the rows as weights). However, When I compare the results of the two exercises, I do not get same result in terms cross entropy, Out of sample error on a different dataset.
Data Formats
Weighted Training data format
Expanded Data training format
Number of rows in expanded data is equal to sum of weight variable in weighted training
###### Expanded data format LGB run ########
When I compare the cross entropy of the two model runs as well as the out of sample predictions. They are very different. OOS Predictions are different by on average 50%.
Questions
The text was updated successfully, but these errors were encountered: