
First of all, I'd like to commend the authors on the excellent work presented in SSS!
I have a quick question regarding the model architecture, specifically related to the frozen image encoder and feature decoder described in Figure 2 of the paper:
Is the frozen Image Encoder identical in structure to the fine-tuned Image Encoder?
Does the Feature Decoder follow the same architecture as the MedSAM-2 Decoder?
To summarize my question: Is the architecture of the frozen backbone (Image Encoder + Feature Decoder) the same as that of MedSAM-2? If not, could you kindly provide a brief description of its structure?
Thank you in advance for your clarification — looking forward to your response!