Retrieving keyframes most relevant to text from small intestine videos with given labels can efficiently and accurately locate pathological regions. However, training directly on raw video data is extremely slow, while learning visual representations from image-text datasets leads to computational inconsistency. To tackle this challenge, a small bowel video keyframe retrieval based on multi-modal contrastive learning (KRCL) is proposed. This framework fully utilizes textual information from video category labels to learn video features closely related to text, while modeling temporal information within a pretrained image-text model. It transfers knowledge learned from image-text multimodal models to the video domain, enabling interaction among medical videos, images, and text data. Experimental results on the hyper-spectral and Kvasir dataset for gastrointestinal disease detection (Hyper-Kvasir) and the Microsoft Research video-to-text (MSR-VTT) retrieval dataset demonstrate the effectiveness and robustness of KRCL, with the proposed method achieving state-of-the-art performance across nearly all evaluation metrics.