SubspaceScout
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sheshe import SubspaceScout, ModalBoundaryClustering
X, y = load_iris(return_X_y=True)
scout = SubspaceScout().fit(X, y)
pair = scout.results_[0]["features"]
ModalBoundaryClustering().fit(X[:, pair], y).plot_classes(X[:, pair], y)
plt.show()
Identifies informative feature subsets so that ModalBoundaryClustering can
run only where it matters in high-dimensional spaces. Uses mutual information or
optional model-based scorers to rank interactions and employs beam search to
enumerate promising subspaces.
Mathematical formulation
Mutual information between a subset X and target Y is I(X;Y)=∑_{x,y} p(x,y) log(p(x,y)/(p(x)p(y))).
The synergy objective for a set S is Synergy(S) = I(S;Y) - ∑_{i∈S} I(X_i;Y), capturing information beyond individual features.
Beam search grows subspaces order by order and retains those with highest scores.
Example
from sheshe import SubspaceScout
scout = SubspaceScout()
scout.fit(X, y)
subspaces = scout.results_
Usage examples
from sheshe import SubspaceScout
scout = SubspaceScout(random_state=0)
scout.fit(X, y) # fit
from sheshe import SubspaceScout
scout = SubspaceScout(random_state=0).fit(X, y)
results = scout.results_ # access discovered subspaces
Additional examples
from sheshe import SubspaceScout
scout = SubspaceScout(
# model_method='lightgbm', # default uses mutual information
max_order=4,
top_m=50,
base_pairs_limit=12,
beam_width=10,
extend_candidate_pool=16,
branch_per_parent=4,
marginal_gain_min=1e-3,
max_eval_per_order=150,
sample_size=4096,
time_budget_s=None,
task='classification',
random_state=0,
)
subspaces = scout.fit(X, y)
Parameters
model_method(Noneor"lightgbm"or"ebm", defaultNone): model used to score subspaces.Noneuses mutual information.max_order(int, default3): maximum size of feature combinations to explore.n_bins(int, default8): number of bins for discretising features.top_m(int, default20): number of features pre-selected by individual mutual information.branch_per_parent(int, default5): maximum extensions generated per parent subspace.density_occup_min(float, default0.03): minimum occupancy ratio for a subspace to be valid.min_support(int, default30): minimum number of samples required in a subspace.sample_size(intorNone, default4096): optional subsampling size to accelerate computations.task("classification"or"regression", default"classification"): learning task.random_state(int, default0): seed for reproducibility.base_pairs_limit(int, default12): maximum number of seed pairs for higher-order searches.beam_width(int, default12): number of candidates retained at each order during beam search.extend_candidate_pool(intorNone, default16): random candidate features sampled per parent when order ≥3.marginal_gain_min(float, default1e-3): minimum synergy gain required to accept an extension.max_eval_per_order(intorNone, default1000): cap on mutual information evaluations per order.time_budget_s(floatorNone, defaultNone): global time budget in seconds forfit.objective("mi_joint"or"mi_synergy", default"mi_joint"): score used to rank subspaces.min_per_order(int, default1): minimum number of subspaces to keep per order.
Methods
fit(X, y)– discover subspaces and store them inresults_.