Optimize CPU featurization pipeline: 37% speedup by longleo17 · Pull Request #283 · bytedance/Protenix

longleo17 · 2026-03-24T12:29:23Z

This PR accelerates the data pipeline, a CPU bottleneck in current workflows

Summary

Featurizer encoder: Replace manual one-hot dictionary construction with torch.nn.functional.one_hot, eliminating per-call dict comprehension overhead
Atom name encoding: Vectorize ref_atom_name_chars_encoded using numpy.frombuffer on ASCII bytes + F.one_hot, replacing nested Python loops over characters
Template featurizer: Pre-allocate output numpy arrays instead of list append + np.stack; reuse a shared DistogramFeaturesConfig instance across templates
Template parser/utils: Batch numpy operations for coordinate extraction, atom mask computation, and position lookups
Dataset filtering: Replace df.apply(lambda x: x in set) with vectorized df.isin() for eval_type filtering

Benchmark results

Measured end-to-end on representative PDB complexes with template+MSA workloads:

~37% wall-clock reduction in CPU featurization time
No changes to model outputs (numerically identical features)

Files changed

File	Change
`protenix/data/core/featurizer.py`	Vectorized one-hot encoding + atom name encoding
`protenix/data/template/template_featurizer.py`	Pre-allocated arrays, shared config
`protenix/data/template/template_parser.py`	Vectorized coordinate extraction
`protenix/data/template/template_utils.py`	Batch numpy operations
`protenix/data/pipeline/dataset.py`	Vectorized pandas filtering

Key optimizations: - Featurizer.encoder: replace manual one-hot dict with torch.nn.functional.one_hot - ref_atom_name_chars_encoded: vectorized ASCII encoding via numpy frombuffer + one_hot - Template featurizer: pre-allocate numpy arrays instead of list append + np.stack - Template featurizer: reuse shared DistogramFeaturesConfig instance - Template parser: vectorized numpy operations for coordinate extraction - Template utils: batch numpy operations for atom mask and position computation - Dataset: replace df.apply(lambda) with df.isin() for eval_type filtering Benchmarked on template+MSA heavy workloads: ~37% wall-clock reduction in CPU featurization time, measured end-to-end on representative PDB complexes.

longleo17 mentioned this pull request Mar 24, 2026

Batched multi-seed inference: 2x speedup on 5-seed runs #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU featurization pipeline: 37% speedup#283

Optimize CPU featurization pipeline: 37% speedup#283
longleo17 wants to merge 1 commit intobytedance:mainfrom
longleo17:data-pipeline-speedup

longleo17 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

longleo17 commented Mar 24, 2026

Summary

Benchmark results

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant