Ovis1.6-Gemma2-9B: A Breakthrough in Multimodal AI Technology by Aidc-AI

By Horay AI Team|Oct 31, 2024

AIDC-AI is the AI team at Alibaba International Digital Commerce Group. Released by Aidc-AI in mid-September 2024, Ovis1.6-gemma2-9b is a groundbreaking Multimodal Large Language Model (MLLM) that has already made great waves among the AI community. With its open-source nature, this model has drawn attention to its versatility and advanced capabilities. Ovis1.6-gemma2-9b scored exceptionally high for its performance on OpenCompass, a respected benchmark for LLMs. It racks up ahead of a number of mainstream open-source models such as MiniCPM-V-2.6, Qwen2VL-7B and InternVL2-26B, and ranks first among open-source models with less than 30 billion parameters. Thus, Ovis1.6-gemma2-9b has consequently stood out as a robust AI tool designed to cater to various industries and applications.

Main Functions of Ovis1.6-gemma2-9b

Mathematical Reasoning
Ovis1.6-gemma2-9b's enhanced reasoning abilities allow it to solve complex mathematical problems precisely. The model’s training enables it to understand and execute tasks that require logical step-by-step problem-solving. It scores even higher then GPT-4o-mini at this benchmark.
Object Recognition
The model also excels at identifying and classifying various objects with a high degree of accuracy. This is especially useful in scenarios requiring detailed visual analysis. For example, the model can easily distinguish between different species of flowers, such as a tulip and a lily.
Text Extraction
The model excels at extracting relevant information from large volumes of text as well. Whether it’s parsing legal documents, summarizing research papers, or extracting data from documents, Ovis1.6-gemma2-9b can quickly identify key pieces of information and present them in a structured way as prompted.
Image Understanding
Ovis1.6-gemma2-9b demonstrates exceptional performance, reaching the benchmark of state-of-the-art (SOTA) models. Its ability to analyze and interpret high-resolution images sets it apart, making it ideal for tasks that demand detailed image comprehension. It could recognize intricate patterns and fine details.
Complex Tasks Processing
Last but not least, one of Ovis1.6-gemma2-9b’s standout features is its ability to handle a variety of prompt types and maintain consistent performance across intricate text inputs. For instance, it can seamlessly process simultaneous text and image inputs. This powerful combination analysis showcases Ovis1.6-gemma2-9b's robust comprehension abilities, making it highly effective in scenarios requiring coordination across multiple formats.

Some cases listed

1. Maths: When given a picture of a math problem, Ovis 1.6-Gemma2-9B efficiently extracts the text from the image and provides a detailed, accurate solution within a short time.Apart from that, all the mathematical expressions are well and coherently shown through the solutions.

2. Food: The model easily identified the type of food in the picture and then provided detailed steps to prepare it, based on the prompt entered.

Advantages of Ovis1.6-gemma2-9b

Innovative Architecture
Ovis1.6-Gemma2-9B has developed an architecture of the combination of Visual/Textual tokenizers, Visual/Textual embedding tables and a large language model. This model introduces learnable Visual/Textual embedding tables to first convert continuous visual/texutual features into visual/textual tokens, and then generates structured visual/textual embeddings after weighting them multiple times by the visual/textual embedding table. This approach can generally overcome the limitations of most MLLMs and then come to improve the performance of various tasks.
High-Resolution Image Processing
The model is capable of processing images with extreme aspect ratios and high resolutions. This allows Ovis1.6-Gemma2-9B to achieve state-of-the-art (SOTA) performance in image understanding, making it suitable for handling detailed and complex visual tasks without losing quality.
Comprehensive Data Optimization
The model has been trained across multiple multimodal datasets, covering a lot of areas like captioning, VQA (Visual Question Answering), OCR (Optical Character Recognition), and chart recognition. This broad training enhances Ovis1.6-Gemma2-9B’s performance in tasks like multimodal Q&A and following detailed instructions, making it versatile across a variety of text and image-based queries.
Open-Source
Following the philosophy of accessibility, Ovis1.6-Gemma2-9B is open-source under the Apache 2.0 license. This allows users to access the model’s weights, training data, and inference code, making it fully customizable and suitable for a wide range of commercial and non-commercial applications. The open-source nature of the Ovis series has made sure of the flexibility for users who wish to adapt the model for their specific needs.

Application of Ovis in Cross-Border E-Commerce

Aidge (Alibaba International AI Platform) is recognized as the e-commerce platform for Alibaba, technologically supported by Aidc-AI. However, international e-commerce faces many challenges: navigating complex overseas markets, high operational costs, competitive pressure, and a lot more. To make this much easier, AIGC technologies like MLLMs can be particularly helpful to provide effective solutions in order to reduce costs and improve efficiency.

For example, one major issue in overseas e-commerce is to manage returns and refunds, which greatly impacts both users and merchants experience. Refunds and return audits were previously done manually, requiring significant labour force, time, and sometimes even leading to inconsistent decisions due to subjective judgment.

Now with Ovis, Aidc-AI has developed an intelligent refund system that leverages the company’s vast e-commerce knowledge. Ovis is able to process user-submitted images and videos related to refund claims, providing fast and consistent assessments. This consequently results in a cost-effective, efficient solution that ensures fair treatment for both consumers and merchants.

In addition to Ovis, Aidc-AI has also developed other advanced tools like the multi-language model Marco and the e-commerce-focused MLLM MarcoVL, offering a lot more MaaS (Model-as-a-Service) capabilities, such as:

Multi-language text generation, allowing AI to optimize and localize product descriptions, breaking language and cultural barriers.
AI-driven image processing, enabling features like virtual fitting with a single click.
Intelligent image enhancement, including automatic background removal.

Therefore, AI has fundamentally transformed how merchants operate and how customers purchase, greatly boosting productivity and reducing costs. For platforms like Aidge, these AI-driven capabilities have become a key competitive advantage.

Conclusion

As AI continues to evolve, MLLMs like Ovis1.6-gemma2-9b are likely to become more integrated into our daily lives, providing seamless assistance across industries. There are a wide range of application scenarios for MLLMs, including automated driving, medical diagnosis, video content understanding, image description generation, and visual Q&A. These rounded applications have made Ovis1.6-gemma2-9b a bright future. From mathematical reasoning to complex task processing, its range of capabilities makes it an invaluable asset in various industries.

Whether you’re a content creator, business leader, or developer, staying updated on what this model can do will allow you to harness its full potential. Please keep an eye on Ovis1.6-gemma2-9b as it continues to push the boundaries of what AI can achieve.