Using Multi-modal Models#

For MTLLM to have actual neurosymbolic powers, it needs to be able to handle multimodal inputs and outputs. This means that it should be able to understand text, images, and videos. In this section, we will discuss how MTLLM can handle multimodal inputs.

Images are directly supported in the default distribution of MTLLM. However, videos are not supported by default. To use videos, you need to install mtllm with the video extra. You can do this by running the following command:

pip install mtllm[video]

Image#

MTLLM can handle images as inputs. You can provide an image as input to the MTLLM Function or Method using the Image format of mtllm. Here is an example of how you can provide an image as input to the MTLLM Function or Method:

import from mtllm.llm { Model, Image }

glob llm = Model(model_name="gpt-4o");

'Personality of the Person'
enum Personality {
   INTROVERT = "Introvert",
   EXTROVERT = "Extrovert"
}

sem Personality.INTROVERT = 'Person who is shy and reticent';
sem Personality.EXTROVERT = 'Person who is outgoing and socially confident';



obj Person {
    has full_name: str,
        yod: int,
        personality: Personality;
}

def get_person_info(img: Image) -> Person by llm();

with entry {
    image = Image("photo.jpg");
    person_obj = get_person_info(image);
    print(person_obj);
}

Input Image (person.png):

Output

Person(full_name='Albert Einstein', yod=1955, personality=Personality.INTROVERT)

In the above example, we have provided an image of a person ("Albert Einstein") as input to the get_person_info method. The method returns the information of the person in the image. The output of the method is a Person object with the name, year of death, and personality of the person in the image.

Video#

Similarly, MTLLM can handle videos as inputs. You can provide a video as input to the MTLLM Function or Method using the Video format of mtllm. Here is an example of how you can provide a video as input to the MTLLM Function or Method:

import from mtllm { Model, Video }

glob llm = Model(model_name="gpt-4o");

def explain_the_video(video: Video) -> str by llm();

with entry {
    video_file_path = "SampleVideo_1280x720_2mb.mp4";
    target_fps = 1
    video = Video(path=video_file_path, fps=target_fps);
    print(explain_the_video(video));
}

Input Video: SampleVideo_1280x720_2mb.mp4

Output

The video features a large rabbit emerging from a burrow in a lush, green environment. The rabbit stretches and yawns, seemingly enjoying the morning. The scene is set in a vibrant, natural setting with bright skies and trees, creating a peaceful and cheerful atmosphere.