Remote sensing scene classification plays a critical role in a wide range of real-world applications. Technically, however, scene classification is an extremely challenging task due to the huge complexity in remotely sensed scenes, and the difficulty in acquiring labelled data for model training such as supervised deep learning. To tackle these issues, a novel semi-supervised ensemble framework is proposed here using the self-training hierarchical prototype-based classifier as the base learner for chunk-by-chunk prediction. The framework has the ability to build a powerful ensemble model from both labelled and unlabelled images with minimum supervision. Different feature descriptors are employed in the proposed ensemble framework to offer multiple independent views of images. Thus, the diversity of base learners is guaranteed for ensemble classification. To further increase the overall accuracy, a novel cross-checking strategy was introduced to enable the base learners to exchange pseudo-labelling information during the self-training process, and maximize the correctness of pseudo-labels assigned to unlabelled images. Extensive numerical experiments on popular benchmark remote sensing scenes demonstrated the effectiveness of the proposed ensemble framework, especially where the number of labelled images available is limited. For example, the classification accuracy achieved on the OPTIMAL-31, PatternNet and RSI-CB256 datasets was up to 99.91%, 98. 67% and 99.07% with only 40% of the image sets used as labelled training images, surpassing or at least on par with mainstream benchmark approaches trained with double the number of labelled images.