eval_toolkit.eda#
EDA-first dataset integrity gating — thin, composable, torch-free
per-split profiling and dataset-soundness diagnostics. Tier-2 surface
per ADR 0003: import explicitly (from eval_toolkit.eda import audit_dataset); nothing here is in the package-root __all__, and the
layer is intentionally evolvable within v1.x.
Three jobs, one module each (plus obfuscation profiling):
Job-1 integrity gate (
data_audit+obfuscation): row counts, class balance, text-length quantiles, dedup / cross-split leakage, obfuscation prevalence.Job-2 lexical shortcut diagnostics (
lexical_association): weighted log-odds + PMI and partial-input / competency baselines — NumPy + scikit-learn only.Job-3 distribution shift (
distribution_shift): proxy-A-distance, MMD (permutation-tested), and kNN purity over feature matrices — embed text first witheval_toolkit.embeddings.make_minilm_embedder(the optional[embeddings]extra) or any vectorizer.
Job-1: data audit#
|
Run the Job-1 dataset integrity gate. |
|
Summarize the binary class distribution of one slice. |
|
Compute char / word (and optionally token) length quantiles. |
|
Build a |
|
Aggregate Job-1 integrity-gate report for a dataset. |
|
Per-split shape + class-balance + length profile. |
|
Job-1: obfuscation profiling#
|
Compute corpus-level obfuscation prevalence across |
|
Count invisible (zero-width / variation-selector / tag) chars. |
|
Detect a base64-shaped or hex-shaped high-entropy run in |
|
Return |
|
Heuristic: does |
|
Count leetspeak tokens and total alnum-shaped tokens in |
|
Return |
|
Absolute char-count delta between |
|
Shannon entropy of |
|
Corpus-level obfuscation prevalence statistics. |
Job-2: lexical association#
|
Log-odds + PMI of the positive class vs the negative class (C1). |
|
Fit partial-input / competency baselines and score the shortcut floor (C2). |
|
Lowercase and split |
|
Weighted log-odds-ratio with an informative Dirichlet prior (Monroe 2008). |
|
One competency baseline's held-out score (C2). |
|
Competency-baseline shortcut floor over one train→test split (C2). |
|
Per-token class-association statistics (C1). |
|
Job-3: distribution shift#
|
Run PAD + MMD + kNN-purity on one pair of feature populations. |
|
Mean fraction of each point's |
|
Unbiased RBF-kernel MMD² with a permutation-test p-value (Gretton 2012). |
|
Median-heuristic RBF bandwidth σ = median pairwise Euclidean distance. |
|
Proxy-A-distance via a CV'd linear domain classifier (Ben-David 2010). |
|
Combined PAD + MMD + kNN-purity shift report for one pair of populations. |
|
kNN domain purity between two feature populations. |
|
Maximum Mean Discrepancy between two feature populations. |
|
Proxy-A-distance between two feature populations. |
Constants#
|
Convert a string or number to a floating-point number, if possible. |
|
Built-in immutable sequence. |
|
int([x]) -> integer int(x, base=10) -> integer |
|
Convert a string or number to a floating-point number, if possible. |
|
int([x]) -> integer int(x, base=10) -> integer |
|
Convert a string or number to a floating-point number, if possible. |
|
int([x]) -> integer int(x, base=10) -> integer |
|
Convert a string or number to a floating-point number, if possible. |
|
int([x]) -> integer int(x, base=10) -> integer |
|
Convert a string or number to a floating-point number, if possible. |
|
Convert a string or number to a floating-point number, if possible. |
|
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str |
|
Convert a string or number to a floating-point number, if possible. |