Skip to content

[c++] Segmentation fault when use parallel_tree_learner method when number of categories of one feature larger than 28 (default max_cat_to_onehot-4). #6491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
moming39 opened this issue Jun 19, 2024 · 4 comments
Labels

Comments

@moming39
Copy link
Contributor

moming39 commented Jun 19, 2024

return 2 * sizeof(int) + sizeof(uint32_t) + sizeof(bool) + sizeof(double) * 7 + sizeof(data_size_t) * 2 + max_cat_threshold * sizeof(uint32_t) + sizeof(int8_t);

This code is used for computing the buffer size of the communicating the split info the in distributed training. But lost the two most recently added parameters(right_sum_gradient_and_hessian, left_sum_gradient_and_hessian) size.
When use parallel_tree_learner method with number of categories of one feature larger than 28 (default max_cat_to_onehot(32)-2*2), it will resut in an Segmentation fault.

This code must be,

  inline static int Size(int max_cat_threshold) {
    return 2 * sizeof(int) + sizeof(uint32_t) + sizeof(bool) + sizeof(double) * 7 + sizeof(data_size_t) * 2 + max_cat_threshold * sizeof(uint32_t) + sizeof(int8_t) + sizeof(int64_t)*2;
  }
@jameslamb jameslamb added the bug label Jun 20, 2024
@MelleVessies
Copy link

MelleVessies commented Dec 6, 2024

Im also having this issue while trying to use lightgbm ^4.0.0 with a ray cluster, the worker crashes with a segfault in DataParallelTreeLearner. I ended up changing the split_info.hpp as stated above and building from source which fixes the issue but it would be great if this could be fixed in master. I tried to write a minimal example to reproduce the issue.

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import ray
from ray.train.lightgbm import LightGBMTrainer
from ray.air.config import ScalingConfig

def generate_synthetic_data(n_rows=10000, n_unique_categories=100):
    np.random.seed(42)    
    print(n_rows)
    table = pa.table({
        "var1": np.concatenate((np.arange(n_unique_categories), np.random.randint(0, n_unique_categories, size=n_rows - n_unique_categories))),
        "label": np.random.randint(0, 2, size=n_rows)
    })
    
    return ray.data.from_arrow(table)

# Load training data
train_set = generate_synthetic_data()

# Train LightGBM
trainer = LightGBMTrainer(
    label_column="label",
    scaling_config=ScalingConfig(num_workers=2),
    params={'categorical_feature': 0, "tree_learner": "data", "verbosity": 2},
    datasets={"train": train_set},
    
)
result = trainer.fit()
print(result.metrics)

This makes the ray worker crash with the following log. Interestingly a warning is printed that categorical_feature is ignored, while actually it still seems to cause issues.

:job_id:01000000
:actor_name:RayTrainWorker
/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/lightgbm/basic.py:2034: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  _log_warning(f'{key} keyword has been found in `params` and will be ignored.\n'
*** SIGSEGV received at time=1733493852 on cpu 3 ***
PC: @     0x7fee9b91f274  (unknown)  LightGBM::SerialTreeLearner::SplitInner()
    @     0x7ff1867fe520       1712  (unknown)
    @     0x7fee9b87f826  (unknown)  LightGBM::DataParallelTreeLearner<>::Split()
    @        0x200001388  129571456  (unknown)
    @     0x7fee9b87fca0  (unknown)  (unknown)
    @ 0x4810c08348fb8948  (unknown)  (unknown)
[2024-12-06 15:04:12,161 E 570227 570348] logging.cc:440: *** SIGSEGV received at time=1733493852 on cpu 3 ***
[2024-12-06 15:04:12,161 E 570227 570348] logging.cc:440: PC: @     0x7fee9b91f274  (unknown)  LightGBM::SerialTreeLearner::SplitInner()
[2024-12-06 15:04:12,162 E 570227 570348] logging.cc:440:     @     0x7ff1867fe520       1712  (unknown)
[2024-12-06 15:04:12,162 E 570227 570348] logging.cc:440:     @     0x7fee9b87f826  (unknown)  LightGBM::DataParallelTreeLearner<>::Split()
[2024-12-06 15:04:12,163 E 570227 570348] logging.cc:440:     @        0x200001388  129571456  (unknown)
[2024-12-06 15:04:12,164 E 570227 570348] logging.cc:440:     @     0x7fee9b87fca0  (unknown)  (unknown)
[2024-12-06 15:04:12,165 E 570227 570348] logging.cc:440:     @ 0x4810c08348fb8948  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/lightgbm/basic.py", line 3891 in update
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/lightgbm/engine.py", line 276 in train
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/train/lightgbm/lightgbm_trainer.py", line 65 in _lightgbm_train_fn_per_worker
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176 in discard_return_wrapper
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 197 in train_fn
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/air/_internal/util.py", line 104 in run
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/threading.py", line 1002 in _bootstrap

Extension modules: psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, lz4._version, lz4.frame._frame, zstandard.backend_c, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pyarrow._parquet, cython.cimports.libc.math, pyarrow._json, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, PIL._imaging, kiwisolver._cext, sklearn.__check_build._check_build, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.stats._unuran.unuran_wrapper, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.utils._random, _cffi_backend (total: 202)

@moming39
Copy link
Contributor Author

moming39 commented Dec 6, 2024

yeah, this error is caused by the above code。#6491 (comment) The question was introduced when lightgbm>=4.0。

@jameslamb
Copy link
Collaborator

@moming39 you seem to have a clear idea of where the bug is and what called it. Would you like to submit a pull request with a fix?

moming39 added a commit to moming39/LightGBM that referenced this issue Dec 7, 2024
microsoft#6491 (comment)
This code is used for computing the buffer size of the communicating the split info the in distributed training. But lost the two most recently added parameters(right_sum_gradient_and_hessian, left_sum_gradient_and_hessian) size.
When use parallel_tree_learner method with number of categories of one feature larger than 28 (default max_cat_to_onehot(32)-2*2), it will resut in an Segmentation fault.
@jameslamb
Copy link
Collaborator

Closing this based on the fix in #6738 and @shiyu1994 's approval there. This fix will go out in the next release of LightGBM.

If you build the latest development version and find that it it still not fixed, please comment and we can re-open it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants