Using Multi-modal Models#
byLLM supports multimodal inputs including text, images, and videos. This section covers how byLLM handles multimodal inputs.
Images are supported in the default byLLM distribution. Video support requires installing byllm with the video
extra:
Image#
byLLM supports image inputs through the Image
format. Images can be provided as input to byLLM functions or methods:
import from byllm.llm { Model, Image }
glob llm = Model(model_name="gpt-4o");
'Personality of the Person'
enum Personality {
INTROVERT = "Introvert",
EXTROVERT = "Extrovert"
}
sem Personality.INTROVERT = 'Person who is shy and reticent';
sem Personality.EXTROVERT = 'Person who is outgoing and socially confident';
obj Person {
has full_name: str,
yod: int,
personality: Personality;
}
def get_person_info(img: Image) -> Person by llm();
with entry {
image = Image("photo.jpg");
person_obj = get_person_info(image);
print(person_obj);
}
Input Image : person.png
Output
Person(full_name='Albert Einstein', yod=1955, personality=Personality.INTROVERT)
In this example, an image of a person is provided as input to the get_person_info
method. The method returns a Person
object containing the extracted information from the image.
Video#
byLLM supports video inputs through the Video
format. Videos can be provided as input to byLLM functions or methods:
import from byllm { Model, Video }
glob llm = Model(model_name="gpt-4o");
def explain_the_video(video: Video) -> str by llm();
with entry {
video_file_path = "SampleVideo_1280x720_2mb.mp4";
target_fps = 1
video = Video(path=video_file_path, fps=target_fps);
print(explain_the_video(video));
}
Input Video: SampleVideo_1280x720_2mb.mp4
Output
The video features a large rabbit emerging from a burrow in a lush, green environment. The rabbit stretches and yawns, seemingly enjoying the morning. The scene is set in a vibrant, natural setting with bright skies and trees, creating a peaceful and cheerful atmosphere.