Recent studies have introduced attention models for medical visual question answering (MVQA). In medical research, not only is the modeling of “visual attention” crucial, but the modeling of “question attention” is equally significant. To facilitate bidirectional reasoning in the attention processes involving medical images and questions, a new MVQA architecture, named MCAN, has been proposed. This architecture incorporated a cross-modal co-attention network, FCAF, which identifies key words in questions and principal parts in images. Through a meta-learning channel attention module (MLCA), weights were adaptively assigned to each word and region, reflecting the model’s focus on specific words and regions during reasoning. Additionally, this study specially designed and developed a medical domain-specific word embedding model, Med-GloVe, to further enhance the model’s accuracy and practical value. Experimental results indicated that MCAN proposed in this study improved the accuracy by 7.7% on free-form questions in the Path-VQA dataset, and by 4.4% on closed-form questions in the VQA-RAD dataset, which effectively improves the accuracy of the medical vision question answer.
To address the issues of difficulty in preserving anatomical structures, low realism of generated images, and loss of high-frequency image information in medical image cross-modal translation, this paper proposes a medical image cross-modal translation method based on diffusion generative adversarial networks . First, an unsupervised translation module is used to convert magnetic resonance imaging (MRI) into pseudo-computed tomography (CT) images. Subsequently, a nonlinear frequency decomposition module is used to extract high-frequency CT images. Finally, the pseudo-CT image is input into the forward process, while the high-frequency CT image as a conditional input is used to guide the reverse process to generate the final CT image. The proposed model is evaluated on the SynthRAD2023 dataset, which is used for CT image generation for radiotherapy planning. The generated brain CT images achieve a Fréchet Inception Distance (FID) score of 33.159 7, a structure similarity index measure (SSIM) of 89.84%, a peak signal-to-noise ratio (PSNR) of 35.596 5 dB, and a mean squared error (MSE) of 17.873 9. The generated pelvic CT images yield an FID score of 33.951 6, a structural similarity index of 91.30%, a PSNR of 34.870 7 dB, and an MSE of 17.465 8. Experimental results show that the proposed model generates highly realistic CT images while preserving anatomical accuracy as much as possible. The transformed CT images can be effectively used in radiotherapy planning, further enhancing diagnostic efficiency.