Possible unexpected behaviour of gs.vario_estimate with masked fields

Hi,

This is the first time I'm rising a possible issue of this library, so I apologize if this is not the right way to communicate it. I'm also not an expert in the library, and my knowledge field is nowhere near geostatistics, but I have been involved in a project where I'm using the variogram estimation and covarianze model fitting to those variograms a lot, so I have taken a look at the source code now and then. 

When trying to compute an isotropic variogram using a field and a mask to that field (where True values are values to exclude from the passed field), i always get a IndexError, either by passing the mask using the `mask` argument, or making the field using `masked_field = ma.array(data.flatten(), mask=mask.flatten())`. Taking a look at the [source code](https://geostat-framework.readthedocs.io/projects/gstools/en/v1.3.0/_modules/gstools/variogram/variogram.html#vario_estimate) i have noticed a potential issue. 

I think that `pnt_cnt` is calculated before the mask is applied, so it doesn't reflect the reduced size of the field after the mask. This leads to an out-of-bounds error when sampling, since the function doesn't update `pnt_cnt` post-masking. When the user passes a `sampling_size` argument, sampling indices are generated, which refer to the original size, not the masked size. Later, when the indices are used to sample the field (`field = field[:, sampled_idx]`), they exceed the available range of the smaller masked array, resulting in the out-of-bounds error.

I will attach the function that raises this exception for me:

```python
def estimate_variogram_isotropic_3d(
    data: np.ndarray,
    bins: np.ndarray,
    mask: np.ndarray,
    sampling_size: int = 2000,
    sampling_seed: int = 19920516,
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Estimate the isotropic variogram from 3D data using gstools.vario_estimate,
    passing the exclusion mask directly.

    Parameters
    ----------
    data : np.ndarray
        3D array containing the volume data.
    bins : np.ndarray
        1D array defining the bin edges for variogram estimation.
    mask : np.ndarray
        3D boolean exclusion mask (True indicates voxels to exclude).
    sampling_size : int, optional
        Number of voxel pairs to sample.
    sampling_seed : int, optional
        Seed for random sampling.

    Returns
    -------
    bin_centers : np.ndarray
        Centers of the distance bins.
    gamma : np.ndarray
        Estimated variogram values.
    """
    # Generate a grid of coordinates for the full volume.
    x = np.arange(data.shape[0])
    y = np.arange(data.shape[1])
    z = np.arange(data.shape[2])
    pos_x, pos_y, pos_z = np.meshgrid(x, y, z, indexing="ij")
    pos_all = np.vstack(
        (
            pos_x.flatten(),
            pos_y.flatten(),
            pos_z.flatten(),
        )
    )

    # Flatten the field and the exclusion mask.
    field = data.flatten()
    mask_flat = mask.flatten()

    print(f"Total number of voxels: {data.size}")

    # Call gs.vario_estimate passing the mask.
    bin_centers, gamma = gs.vario_estimate(
        pos=pos_all,
        field=field,
        mask=mask_flat,
        bin_edges=bins,
        mesh_type="unstructured",
        sampling_size=sampling_size,
        sampling_seed=sampling_seed,
    )
    return bin_centers, gamma
```

And a possible solution from the user point-of-view, which would be to apply the mask yourself to the position and field arrays. In this approach, we compute the valid (i.e. unmasked) indices and then pass only those points. This ensures that the number of points used in the sampling step (pnt_cnt) exactly matches the number of unmasked points.

```python
def estimate_variogram_isotropic_3d(
    data: np.ndarray,
    bins: np.ndarray,
    mask: np.ndarray,
    sampling_size: int = 2000,
    sampling_seed: int = 19920516,
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Estimate the isotropic variogram from 3D data using gstools.vario_estimate.
    Instead of relying on the mask argument (which causes a mismatch in point counts),
    this function pre-selects the unmasked (valid) points and passes only those.
    
    Parameters
    ----------
    data : np.ndarray
        3D volume data.
    bins : np.ndarray
        1D array of bin edges for the variogram.
    mask : np.ndarray
        3D boolean exclusion mask (True indicates voxels to exclude).
    sampling_size : int, optional
        Number of voxel pairs to sample.
    sampling_seed : int, optional
        Seed for random sampling.
    
    Returns
    -------
    bin_centers : np.ndarray
        Bin centers.
    gamma : np.ndarray
        Estimated variogram values.
    """
    # Generate the full grid of coordinates.
    x = np.arange(data.shape[0])
    y = np.arange(data.shape[1])
    z = np.arange(data.shape[2])
    pos_x, pos_y, pos_z = np.meshgrid(x, y, z, indexing="ij")
    pos_all = np.vstack((pos_x.flatten(), pos_y.flatten(), pos_z.flatten()))
    
    # Create the full field and mask.
    full_field = data.flatten()
    full_mask = mask.flatten()
    
    # Select only the unmasked (valid) indices.
    valid_idx = np.flatnonzero(~full_mask)
    pos_valid = pos_all[:, valid_idx]
    field_valid = np.atleast_2d(full_field[valid_idx])
    
    # Now call gstools without a mask (or with np.ma.nomask) because we've pre-filtered.
    bin_centers, gamma = gs.vario_estimate(
        pos=pos_valid,
        field=field_valid,
        mask=np.ma.nomask,
        bin_edges=bins,
        mesh_type="unstructured",
        sampling_size=sampling_size,
        sampling_seed=sampling_seed,
    )
    return bin_centers, gamma
```

I hope this helps, and i also hope I'm doing something wrong. Thank you for reading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible unexpected behaviour of gs.vario_estimate with masked fields #377

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible unexpected behaviour of gs.vario_estimate with masked fields #377

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions