[c++] Segmentation fault when use parallel_tree_learner method when number of categories of one feature larger than 28 (default max_cat_to_onehot-4). #6491

moming39 · 2024-06-19T12:25:20Z

Line 56 in 5cd95a5

    
           return 2 * sizeof(int) + sizeof(uint32_t) + sizeof(bool) + sizeof(double) * 7 + sizeof(data_size_t) * 2 + max_cat_threshold * sizeof(uint32_t) + sizeof(int8_t);

This code is used for computing the buffer size of the communicating the split info the in distributed training. But lost the two most recently added parameters(right_sum_gradient_and_hessian, left_sum_gradient_and_hessian) size.
When use parallel_tree_learner method with number of categories of one feature larger than 28 (default max_cat_to_onehot(32)-2*2), it will resut in an Segmentation fault.

This code must be,

  inline static int Size(int max_cat_threshold) {
    return 2 * sizeof(int) + sizeof(uint32_t) + sizeof(bool) + sizeof(double) * 7 + sizeof(data_size_t) * 2 + max_cat_threshold * sizeof(uint32_t) + sizeof(int8_t) + sizeof(int64_t)*2;
  }

MelleVessies · 2024-12-06T14:21:39Z

Im also having this issue while trying to use lightgbm ^4.0.0 with a ray cluster, the worker crashes with a segfault in DataParallelTreeLearner. I ended up changing the split_info.hpp as stated above and building from source which fixes the issue but it would be great if this could be fixed in master. I tried to write a minimal example to reproduce the issue.

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import ray
from ray.train.lightgbm import LightGBMTrainer
from ray.air.config import ScalingConfig

def generate_synthetic_data(n_rows=10000, n_unique_categories=100):
    np.random.seed(42)    
    print(n_rows)
    table = pa.table({
        "var1": np.concatenate((np.arange(n_unique_categories), np.random.randint(0, n_unique_categories, size=n_rows - n_unique_categories))),
        "label": np.random.randint(0, 2, size=n_rows)
    })
    
    return ray.data.from_arrow(table)

# Load training data
train_set = generate_synthetic_data()

# Train LightGBM
trainer = LightGBMTrainer(
    label_column="label",
    scaling_config=ScalingConfig(num_workers=2),
    params={'categorical_feature': 0, "tree_learner": "data", "verbosity": 2},
    datasets={"train": train_set},
    
)
result = trainer.fit()
print(result.metrics)

This makes the ray worker crash with the following log. Interestingly a warning is printed that categorical_feature is ignored, while actually it still seems to cause issues.

:job_id:01000000
:actor_name:RayTrainWorker
/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/lightgbm/basic.py:2034: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  _log_warning(f'{key} keyword has been found in `params` and will be ignored.\n'
*** SIGSEGV received at time=1733493852 on cpu 3 ***
PC: @     0x7fee9b91f274  (unknown)  LightGBM::SerialTreeLearner::SplitInner()
    @     0x7ff1867fe520       1712  (unknown)
    @     0x7fee9b87f826  (unknown)  LightGBM::DataParallelTreeLearner<>::Split()
    @        0x200001388  129571456  (unknown)
    @     0x7fee9b87fca0  (unknown)  (unknown)
    @ 0x4810c08348fb8948  (unknown)  (unknown)
[2024-12-06 15:04:12,161 E 570227 570348] logging.cc:440: *** SIGSEGV received at time=1733493852 on cpu 3 ***
[2024-12-06 15:04:12,161 E 570227 570348] logging.cc:440: PC: @     0x7fee9b91f274  (unknown)  LightGBM::SerialTreeLearner::SplitInner()
[2024-12-06 15:04:12,162 E 570227 570348] logging.cc:440:     @     0x7ff1867fe520       1712  (unknown)
[2024-12-06 15:04:12,162 E 570227 570348] logging.cc:440:     @     0x7fee9b87f826  (unknown)  LightGBM::DataParallelTreeLearner<>::Split()
[2024-12-06 15:04:12,163 E 570227 570348] logging.cc:440:     @        0x200001388  129571456  (unknown)
[2024-12-06 15:04:12,164 E 570227 570348] logging.cc:440:     @     0x7fee9b87fca0  (unknown)  (unknown)
[2024-12-06 15:04:12,165 E 570227 570348] logging.cc:440:     @ 0x4810c08348fb8948  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/lightgbm/basic.py", line 3891 in update
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/lightgbm/engine.py", line 276 in train
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/train/lightgbm/lightgbm_trainer.py", line 65 in _lightgbm_train_fn_per_worker
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176 in discard_return_wrapper
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 197 in train_fn
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/site-packages/ray/air/_internal/util.py", line 104 in run
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/home/melle/miniconda3/envs/thor/lib/python3.11/threading.py", line 1002 in _bootstrap

Extension modules: psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, lz4._version, lz4.frame._frame, zstandard.backend_c, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pyarrow._parquet, cython.cimports.libc.math, pyarrow._json, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, PIL._imaging, kiwisolver._cext, sklearn.__check_build._check_build, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.stats._unuran.unuran_wrapper, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.utils._random, _cffi_backend (total: 202)

moming39 · 2024-12-06T14:32:13Z

yeah, this error is caused by the above code。#6491 (comment) The question was introduced when lightgbm>=4.0。

jameslamb · 2024-12-06T15:18:40Z

@moming39 you seem to have a clear idea of where the bug is and what called it. Would you like to submit a pull request with a fix?

microsoft#6491 (comment) This code is used for computing the buffer size of the communicating the split info the in distributed training. But lost the two most recently added parameters(right_sum_gradient_and_hessian, left_sum_gradient_and_hessian) size. When use parallel_tree_learner method with number of categories of one feature larger than 28 (default max_cat_to_onehot(32)-2*2), it will resut in an Segmentation fault.

jameslamb · 2024-12-11T04:24:57Z

Closing this based on the fix in #6738 and @shiyu1994 's approval there. This fix will go out in the next release of LightGBM.

If you build the latest development version and find that it it still not fixed, please comment and we can re-open it.

jameslamb added the bug label Jun 20, 2024

moming39 mentioned this issue Dec 7, 2024

[c++] fix parallel_tree_learner_split_info #6738

Merged

jameslamb closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c++] Segmentation fault when use parallel_tree_learner method when number of categories of one feature larger than 28 (default max_cat_to_onehot-4). #6491

[c++] Segmentation fault when use parallel_tree_learner method when number of categories of one feature larger than 28 (default max_cat_to_onehot-4). #6491

moming39 commented Jun 19, 2024 •

edited

Loading

MelleVessies commented Dec 6, 2024 •

edited

Loading

Uh oh!

moming39 commented Dec 6, 2024

Uh oh!

jameslamb commented Dec 6, 2024

Uh oh!

jameslamb commented Dec 11, 2024

Uh oh!

[c++] Segmentation fault when use parallel_tree_learner method when number of categories of one feature larger than 28 (default max_cat_to_onehot-4). #6491

[c++] Segmentation fault when use parallel_tree_learner method when number of categories of one feature larger than 28 (default max_cat_to_onehot-4). #6491

Comments

moming39 commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MelleVessies commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moming39 commented Dec 6, 2024

Uh oh!

jameslamb commented Dec 6, 2024

Uh oh!

jameslamb commented Dec 11, 2024

Uh oh!

moming39 commented Jun 19, 2024 •

edited

Loading

MelleVessies commented Dec 6, 2024 •

edited

Loading