eval_toolkit.eda#

EDA-first dataset integrity gating — thin, composable, torch-free per-split profiling and dataset-soundness diagnostics. Tier-2 surface per ADR 0003: import explicitly (from eval_toolkit.eda import audit_dataset); nothing here is in the package-root __all__, and the layer is intentionally evolvable within v1.x.

Three jobs, one module each (plus obfuscation profiling):

  • Job-1 integrity gate (data_audit + obfuscation): row counts, class balance, text-length quantiles, dedup / cross-split leakage, obfuscation prevalence.

  • Job-2 lexical shortcut diagnostics (lexical_association): weighted log-odds + PMI and partial-input / competency baselines — NumPy + scikit-learn only.

  • Job-3 distribution shift (distribution_shift): proxy-A-distance, MMD (permutation-tested), and kNN purity over feature matrices — embed text first with eval_toolkit.embeddings.make_minilm_embedder (the optional [embeddings] extra) or any vectorizer.

Job-1: data audit#

audit_dataset

Run the Job-1 dataset integrity gate.

class_balance

Summarize the binary class distribution of one slice.

length_quantiles

Compute char / word (and optionally token) length quantiles.

summarize_split

Build a SplitSummary for one slice.

DataAudit

Aggregate Job-1 integrity-gate report for a dataset.

SplitSummary

Per-split shape + class-balance + length profile.

Tokenizer

Job-1: obfuscation profiling#

analyze_obfuscation

Compute corpus-level obfuscation prevalence across texts.

count_invisible_chars

Count invisible (zero-width / variation-selector / tag) chars.

has_high_entropy_alnum_run

Detect a base64-shaped or hex-shaped high-entropy run in text.

has_rot13_marker

Return True if any ROT13-encoded PI marker appears in text.

is_leeted_token

Heuristic: does token look like leetspeak?

leetspeak_counts

Count leetspeak tokens and total alnum-shaped tokens in text.

nfkc_changed

Return True iff NFKC normalization changes text in any way.

nfkc_char_delta

Absolute char-count delta between text and its NFKC normalization.

shannon_entropy

Shannon entropy of s measured in bits/character.

ObfuscationProfile

Corpus-level obfuscation prevalence statistics.

Job-2: lexical association#

class_lexical_association

Log-odds + PMI of the positive class vs the negative class (C1).

competency_baselines

Fit partial-input / competency baselines and score the shortcut floor (C2).

default_tokenizer

Lowercase and split text into word/number tokens.

weighted_log_odds

Weighted log-odds-ratio with an informative Dirichlet prior (Monroe 2008).

BaselineScore

One competency baseline's held-out score (C2).

CompetencyResult

Competency-baseline shortcut floor over one train→test split (C2).

LexicalAssociationResult

Per-token class-association statistics (C1).

StrTokenizer

Job-3: distribution shift#

distribution_shift

Run PAD + MMD + kNN-purity on one pair of feature populations.

knn_purity

Mean fraction of each point's k nearest neighbours sharing its domain.

maximum_mean_discrepancy

Unbiased RBF-kernel MMD² with a permutation-test p-value (Gretton 2012).

median_bandwidth

Median-heuristic RBF bandwidth σ = median pairwise Euclidean distance.

proxy_a_distance

Proxy-A-distance via a CV'd linear domain classifier (Ben-David 2010).

DistributionShiftResult

Combined PAD + MMD + kNN-purity shift report for one pair of populations.

KnnPurityResult

kNN domain purity between two feature populations.

MmdResult

Maximum Mean Discrepancy between two feature populations.

PadResult

Proxy-A-distance between two feature populations.

Constants#

BASE64_ENTROPY_THRESHOLD

Convert a string or number to a floating-point number, if possible.

DEFAULT_CHAR_NGRAM_RANGE

Built-in immutable sequence.

DEFAULT_KNN_K

int([x]) -> integer int(x, base=10) -> integer

DEFAULT_MAX_NEG_POS_RATIO

Convert a string or number to a floating-point number, if possible.

DEFAULT_MIN_COUNT

int([x]) -> integer int(x, base=10) -> integer

DEFAULT_MIN_NEG_POS_RATIO

Convert a string or number to a floating-point number, if possible.

DEFAULT_MMD_PERMUTATIONS

int([x]) -> integer int(x, base=10) -> integer

DEFAULT_PAD_C

Convert a string or number to a floating-point number, if possible.

DEFAULT_PAD_FOLDS

int([x]) -> integer int(x, base=10) -> integer

DEFAULT_PCT_OVER_CONTEXT_THRESHOLD

Convert a string or number to a floating-point number, if possible.

DEFAULT_PRIOR_SCALE

Convert a string or number to a floating-point number, if possible.

EDA_AUDIT_SCHEMA_VERSION

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

HEX_ENTROPY_THRESHOLD

Convert a string or number to a floating-point number, if possible.