`eval_toolkit.eda`#

EDA-first dataset integrity gating — thin, composable, torch-free per-split profiling and dataset-soundness diagnostics. Tier-2 surface per ADR 0003: import explicitly (from eval_toolkit.eda import audit_dataset); nothing here is in the package-root __all__, and the layer is intentionally evolvable within v1.x.

Three jobs, one module each (plus obfuscation profiling):

Job-1 integrity gate (data_audit + obfuscation): row counts, class balance, text-length quantiles, dedup / cross-split leakage, obfuscation prevalence.
Job-2 lexical shortcut diagnostics (lexical_association): weighted log-odds + PMI and partial-input / competency baselines — NumPy + scikit-learn only.
Job-3 distribution shift (distribution_shift): proxy-A-distance, MMD (permutation-tested), and kNN purity over feature matrices — embed text first with eval_toolkit.embeddings.make_minilm_embedder (the optional [embeddings] extra) or any vectorizer.

Job-1: data audit#

`audit_dataset`	Run the Job-1 dataset integrity gate.
`class_balance`	Summarize the binary class distribution of one slice.
`length_quantiles`	Compute char / word (and optionally token) length quantiles.
`summarize_split`	Build a `SplitSummary` for one slice.
`DataAudit`	Aggregate Job-1 integrity-gate report for a dataset.
`SplitSummary`	Per-split shape + class-balance + length profile.
`Tokenizer`

Job-1: obfuscation profiling#

`analyze_obfuscation`	Compute corpus-level obfuscation prevalence across `texts`.
`count_invisible_chars`	Count invisible (zero-width / variation-selector / tag) chars.
`has_high_entropy_alnum_run`	Detect a base64-shaped or hex-shaped high-entropy run in `text`.
`has_rot13_marker`	Return `True` if any ROT13-encoded PI marker appears in `text`.
`is_leeted_token`	Heuristic: does `token` look like leetspeak?
`leetspeak_counts`	Count leetspeak tokens and total alnum-shaped tokens in `text`.
`nfkc_changed`	Return `True` iff NFKC normalization changes `text` in any way.
`nfkc_char_delta`	Absolute char-count delta between `text` and its NFKC normalization.
`shannon_entropy`	Shannon entropy of `s` measured in bits/character.
`ObfuscationProfile`	Corpus-level obfuscation prevalence statistics.

Job-2: lexical association#

`class_lexical_association`	Log-odds + PMI of the positive class vs the negative class (C1).
`competency_baselines`	Fit partial-input / competency baselines and score the shortcut floor (C2).
`default_tokenizer`	Lowercase and split `text` into word/number tokens.
`weighted_log_odds`	Weighted log-odds-ratio with an informative Dirichlet prior (Monroe 2008).
`BaselineScore`	One competency baseline's held-out score (C2).
`CompetencyResult`	Competency-baseline shortcut floor over one train→test split (C2).
`LexicalAssociationResult`	Per-token class-association statistics (C1).
`StrTokenizer`

Job-3: distribution shift#

`distribution_shift`	Run PAD + MMD + kNN-purity on one pair of feature populations.
`knn_purity`	Mean fraction of each point's `k` nearest neighbours sharing its domain.
`maximum_mean_discrepancy`	Unbiased RBF-kernel MMD² with a permutation-test p-value (Gretton 2012).
`median_bandwidth`	Median-heuristic RBF bandwidth σ = median pairwise Euclidean distance.
`proxy_a_distance`	Proxy-A-distance via a CV'd linear domain classifier (Ben-David 2010).
`DistributionShiftResult`	Combined PAD + MMD + kNN-purity shift report for one pair of populations.
`KnnPurityResult`	kNN domain purity between two feature populations.
`MmdResult`	Maximum Mean Discrepancy between two feature populations.
`PadResult`	Proxy-A-distance between two feature populations.

Constants#

`BASE64_ENTROPY_THRESHOLD`	Convert a string or number to a floating-point number, if possible.
`DEFAULT_CHAR_NGRAM_RANGE`	Built-in immutable sequence.
`DEFAULT_KNN_K`	int([x]) -> integer int(x, base=10) -> integer
`DEFAULT_MAX_NEG_POS_RATIO`	Convert a string or number to a floating-point number, if possible.
`DEFAULT_MIN_COUNT`	int([x]) -> integer int(x, base=10) -> integer
`DEFAULT_MIN_NEG_POS_RATIO`	Convert a string or number to a floating-point number, if possible.
`DEFAULT_MMD_PERMUTATIONS`	int([x]) -> integer int(x, base=10) -> integer
`DEFAULT_PAD_C`	Convert a string or number to a floating-point number, if possible.
`DEFAULT_PAD_FOLDS`	int([x]) -> integer int(x, base=10) -> integer
`DEFAULT_PCT_OVER_CONTEXT_THRESHOLD`	Convert a string or number to a floating-point number, if possible.
`DEFAULT_PRIOR_SCALE`	Convert a string or number to a floating-point number, if possible.
`EDA_AUDIT_SCHEMA_VERSION`	str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
`HEX_ENTROPY_THRESHOLD`	Convert a string or number to a floating-point number, if possible.

eval_toolkit.eda#