How do Multimodal AI models work? Simple explanation