pycsa.core.validation¶

Held-out validation utilities for the structured prior.

Two callables here. spatial_cv_score() is the workhorse — it runs k-fold spatial cross-validation for an arbitrary pycsa.core.priors.Prior at a fixed lmbda, returning the per-fold and mean held-out MSE. SpatialCVSelector in hyperparams.py uses it internally over a lmbda grid; users can call it directly to validate any prior choice without going through a selector.

Patch geometry, made concrete. Phase 1’s plan flagged this as the most under-specified piece. The implementation here:

Takes per-row coordinates as a (n_points, 2) array (any metric — local Cartesian, (lon, lat), whatever the caller’s cell uses). When coords is None we fall back to row-index (np.arange) ordering — that is, we treat the points as already in scan-line order and split by index. That’s the only setting where the fallback is correct; the caller is responsible for providing real coordinates when the data is on a Delaunay grid or any other non-scan-line layout.
Computes a 2D bounding box from the supplied coords, partitions it into a near-square r × c grid where r·c ≥ n_folds, and assigns each fold to one tile. Excess tiles are unused. Tiles are contiguous in coordinate space — this is what spatial cross-validation actually means; per-point random shuffling leaks long-wavelength modes across folds and would silently overstate held-out accuracy.
Each held-out tile has a buffer zone of width buffer_fraction · tile_side around it. Points inside the buffer are excluded from both the training set and the evaluation set for that fold.

Documented limitation: works for cells whose points roughly fill a 2D region (MERIT regional cells, ETOPO regional cells). For ICON Delaunay-triangle cells with sparse coverage near a cell vertex the bounding-box partition may produce empty tiles — the function raises in that case so the failure is visible.

Functions

spatial_cv_score(prior, lmbda, ...[, ...])

K-fold spatial cross-validation for any Prior.

pycsa.core.validation.spatial_cv_score(prior: Prior, lmbda: float, design_matrix: ndarray, data: ndarray, *, coords: ndarray | None = None, n_folds: int = 5, buffer_fraction: float = 0.1, rng_seed: int | None = None) → dict¶

K-fold spatial cross-validation for any Prior.

Solves the regularized normal equations on each fold’s training rows, predicts the held-out rows, and returns the per-fold and mean reconstruction MSE.

Parameters:

prior – Any pycsa.core.priors.Prior. Called per-fold with the fold’s normal-equations matrix.
lmbda – Regularization scale passed to the prior.
design_matrix – Dense M matrix, shape (n_points, n_modes).
data – Target vector, shape (n_points,).
coords – Per-row 2D coordinates for spatial fold construction. If None, falls back to a strided index split — only appropriate when rows are already in scan-line order.
n_folds – See module docstring.
buffer_fraction – See module docstring.
rng_seed – Seed for the RNG that shuffles the tile assignment order so the chosen folds are spatially spread rather than packed in one corner. None leaves the order unseeded.

Returns:

per_fold_mse: ndarray of length n_folds.
mean_heldout_mse: Mean of per_fold_mse.
fold_sizes: ndarray of shape (n_folds, 2) — (n_train, n_eval) per fold.

Return type:

dict with keys

Raises:

ValueError – If coords rows do not match design_matrix rows, or if _build_spatial_folds cannot tile the points (fewer than 2 * n_folds points, zero extent in an axis, or a fold tile ending up with no eval points / fewer than two train points).