Realizing multi-modal information recognition tasks which can process external information efficiently and comprehensively is an urgent requirement in the field of artificial intelligence. However, it ...