JTCNet: Joint Transformer and CNN Network for Remote Heart Rate Estimation
While many deep learning-based approaches reported promising results on remote photoplethysmography (rPPG) based physiological signal sensing, they typically employ a pre-processing step to convert a 2D video sequence into a concatenated temporal signal map in order to apply 2D convolutions to capture the periodic cues of heart rate (HR) signal. Still, the capability of convolutional neural network (CNNs) based methods in exploiting long-range dynamic cues of HR signals remains limited. In this paper, we propose a new rPPG based HR estimation method that is expected to overcome the limitations of CNN-based methods by taking into account both local changes and long-range changes of the HR temporal signal via a Joint Transformer and Convolutional Netowork (JTCNet). In particular, our JTCNet follows a two-steam architecture, in which a revised U-Net branch called BVPNet aims to capture local spatial-temporal changes from the temporal HR signals using newly designed spatial and frequency-domain losses, and a Transformer branch aims to capture long-range HR signal correlations. Therefore, the learned features can complement each other well. The experiments on the VIPL-HR and MMSE-HR datasets show that our method outperforms the state-of-the-art methods, which use either raw videos or pre-processed videos as input.