[python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) #6679
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #6678
Fixes #6680
As of scikit-learn/scikit-learn#29818 (which will be in
scikit-learn
1.6),scikit-learn
contains an estimator check that enforces the following behavior:This new check is breaking CI here... LightGBM does not work that way.
weight=0.0
samples' feature values still contribute to feature distributions and therefore bin boundaries inDataset
histograms (How observations with sample_weight of zero influence the fit of LGBMRegressor #5553)1
sample for the perspective of count-based hyper-parameters likemin_data_in_leaf
([R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) #5626 (comment))This PR proposes skipping that estimator check, just as
scikit-learn
is currently doing forHistGradientBoosting*
estimators (code link).Notes for Reviewers
We could modify LightGBM's
scikit-learn
estimators to match this expected behavior by excluding rows withweight=0
and creating copies of rows withint(weight)>=2
before passing it through toDataset
here:LightGBM/python-package/lightgbm/sklearn.py
Lines 938 to 947 in bbeecc0
I don't support doing that... I think it'd add significant complexity (do count-based hyperparameters need to be modified? how does this affect Dask estimators?) for questionable benefit.
I think just skipping this compliance check is the right thing for
lightgbm
right now.