What is a Token? Comparing Multimodal Costs

Understanding AI pricing starts with understanding tokens. While text tokenization is well-documented, multimodal tokens (images and video) are calculated very differently by providers like Google and OpenAI.

The Text Baseline

In language models, a token roughly translates to 3/4 of a word. When you use an API, you pay for both the prompt (input) and the generated response (output).

Multimodal Complexity: Images

Unlike text, images are broken down into "tiles."

OpenAI (GPT-4o) uses a base cost of 85 tokens, plus 170 tokens for every 512x512 pixel tile.
Google (Gemini 1.5) uses a flat 258 tokens per image, or calculates 768x768 tiles depending on the model tier.

This means a high-resolution 4K image will cost significantly more to process than a compressed 1080p image.

Video Pricing

Video introduces the concept of Frames Per Second (FPS). To save costs, developers rarely send full 60 FPS video to an AI model. Instead, models sample the video at 1 FPS.

Use our Multimodal Calculator to instantly compare how different models price your specific media files locally and securely.