SubspaceScout


import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sheshe import SubspaceScout, ModalBoundaryClustering

X, y = load_iris(return_X_y=True)
scout = SubspaceScout().fit(X, y)
pair = scout.results_[0]["features"]
ModalBoundaryClustering().fit(X[:, pair], y).plot_classes(X[:, pair], y)
plt.show()

Identifies informative feature subsets so that ModalBoundaryClustering can

run only where it matters in high-dimensional spaces. Uses mutual information or

optional model-based scorers to rank interactions and employs beam search to

enumerate promising subspaces.

Mathematical formulation

Mutual information between a subset X and target Y is I(X;Y)=∑_{x,y} p(x,y) log(p(x,y)/(p(x)p(y))).

The synergy objective for a set S is Synergy(S) = I(S;Y) - ∑_{i∈S} I(X_i;Y), capturing information beyond individual features.

Beam search grows subspaces order by order and retains those with highest scores.

Example


from sheshe import SubspaceScout
scout = SubspaceScout()
scout.fit(X, y)
subspaces = scout.results_

Usage examples


from sheshe import SubspaceScout

scout = SubspaceScout(random_state=0)
scout.fit(X, y)          # fit

from sheshe import SubspaceScout

scout = SubspaceScout(random_state=0).fit(X, y)
results = scout.results_ # access discovered subspaces

Additional examples


from sheshe import SubspaceScout

scout = SubspaceScout(
    # model_method='lightgbm',  # default uses mutual information
    max_order=4,
    top_m=50,
    base_pairs_limit=12,
    beam_width=10,
    extend_candidate_pool=16,
    branch_per_parent=4,
    marginal_gain_min=1e-3,
    max_eval_per_order=150,
    sample_size=4096,
    time_budget_s=None,
    task='classification',
    random_state=0,
)
subspaces = scout.fit(X, y)

Parameters

  • model_method (None or "lightgbm" or "ebm", default None): model used to score subspaces. None uses mutual information.
  • max_order (int, default 3): maximum size of feature combinations to explore.
  • n_bins (int, default 8): number of bins for discretising features.
  • top_m (int, default 20): number of features pre-selected by individual mutual information.
  • branch_per_parent (int, default 5): maximum extensions generated per parent subspace.
  • density_occup_min (float, default 0.03): minimum occupancy ratio for a subspace to be valid.
  • min_support (int, default 30): minimum number of samples required in a subspace.
  • sample_size (int or None, default 4096): optional subsampling size to accelerate computations.
  • task ("classification" or "regression", default "classification"): learning task.
  • random_state (int, default 0): seed for reproducibility.
  • base_pairs_limit (int, default 12): maximum number of seed pairs for higher-order searches.
  • beam_width (int, default 12): number of candidates retained at each order during beam search.
  • extend_candidate_pool (int or None, default 16): random candidate features sampled per parent when order ≥3.
  • marginal_gain_min (float, default 1e-3): minimum synergy gain required to accept an extension.
  • max_eval_per_order (int or None, default 1000): cap on mutual information evaluations per order.
  • time_budget_s (float or None, default None): global time budget in seconds for fit.
  • objective ("mi_joint" or "mi_synergy", default "mi_joint"): score used to rank subspaces.
  • min_per_order (int, default 1): minimum number of subspaces to keep per order.

Methods

  • fit(X, y) – discover subspaces and store them in results_.