Skip to content

Optimize CPU featurization pipeline: 37% speedup#283

Open
longleo17 wants to merge 1 commit intobytedance:mainfrom
longleo17:data-pipeline-speedup
Open

Optimize CPU featurization pipeline: 37% speedup#283
longleo17 wants to merge 1 commit intobytedance:mainfrom
longleo17:data-pipeline-speedup

Conversation

@longleo17
Copy link
Copy Markdown
Contributor

This PR accelerates the data pipeline, a CPU bottleneck in current workflows

Summary

  • Featurizer encoder: Replace manual one-hot dictionary construction with torch.nn.functional.one_hot, eliminating per-call dict comprehension overhead
  • Atom name encoding: Vectorize ref_atom_name_chars_encoded using numpy.frombuffer on ASCII bytes + F.one_hot, replacing nested Python loops over characters
  • Template featurizer: Pre-allocate output numpy arrays instead of list append + np.stack; reuse a shared DistogramFeaturesConfig instance across templates
  • Template parser/utils: Batch numpy operations for coordinate extraction, atom mask computation, and position lookups
  • Dataset filtering: Replace df.apply(lambda x: x in set) with vectorized df.isin() for eval_type filtering

Benchmark results

Measured end-to-end on representative PDB complexes with template+MSA workloads:

  • ~37% wall-clock reduction in CPU featurization time
  • No changes to model outputs (numerically identical features)

Files changed

File Change
protenix/data/core/featurizer.py Vectorized one-hot encoding + atom name encoding
protenix/data/template/template_featurizer.py Pre-allocated arrays, shared config
protenix/data/template/template_parser.py Vectorized coordinate extraction
protenix/data/template/template_utils.py Batch numpy operations
protenix/data/pipeline/dataset.py Vectorized pandas filtering

Key optimizations:
- Featurizer.encoder: replace manual one-hot dict with torch.nn.functional.one_hot
- ref_atom_name_chars_encoded: vectorized ASCII encoding via numpy frombuffer + one_hot
- Template featurizer: pre-allocate numpy arrays instead of list append + np.stack
- Template featurizer: reuse shared DistogramFeaturesConfig instance
- Template parser: vectorized numpy operations for coordinate extraction
- Template utils: batch numpy operations for atom mask and position computation
- Dataset: replace df.apply(lambda) with df.isin() for eval_type filtering

Benchmarked on template+MSA heavy workloads: ~37% wall-clock reduction in
CPU featurization time, measured end-to-end on representative PDB complexes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant