You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pytorch implementation can be found in encoder_decoder.py.
Encoder
The encoder is an EfficientNet with weights pretrained on ImageNet.
The final layer of the EfficientNet is removed all prior layers are frozen for the duration of the training process.
The image embedding is passed through a linear layer to reduce the dimensionality of the feature vector to the dimensionality of the joint embedding space.
This final layer is jointly trained along with the decoder in order to learn the joint embedding space.
Decoder
The decoder is an LSTM which generates a caption for the image.
At the start of the decoding process, the feature vector from the encoder is passed through the LSTM to allow the hidden state to view the embedded representation of the image.
A linear layer is added in order to map the hidden state outputs to the vocabulary space, in order to generate a probability distribution over the next word in the caption.