Our Large Language Model as a Service (LLMaaS) offering gives you access to cutting-edge language models, inferred using SecNumCloud-qualified infrastructure, HDS-certified for healthcare data hosting, and therefore sovereign, calculated in France. Benefit from high performance and optimal security for your AI applications. Your data remains strictly confidential, and is neither exploited nor stored after processing.

Simple, transparent pricing
0.9 €
per million input tokens
4 €
per million tokens issued
21 €
per million reasoning tokens
0,01 €
per minute of transcribed audio *
Calculated on an infrastructure based in France, SecNumcloud qualified and HDS certified.
Note on the "Reasoning" price: This price applies specifically to models classified as "reasoners" or "hybrids" (models with the "Reasoning" capability activated) when reasoning is active and only on tokens linked to this activity.
* any minute started is counted

Large models

Our large models offer state-of-the-art performance for the most demanding tasks. They are particularly well-suited to applications requiring a deep understanding of language, complex reasoning or the processing of long documents.

140 tokens/second

gpt-oss:120b

OpenAI's state-of-the-art open-weight language model, offering solid performance with a flexible Apache 2.0 licence.
A Mixture-of-Experts (MoE) model with 120 billion parameters and around 5.1 billion active parameters. It offers a configurable reasoning effort and full access to the chain of thought.
31 tokens/second

llama3.3:70b

State-of-the-art multilingual model developed by Meta, designed to excel at natural dialogue, complex reasoning and nuanced understanding of instructions.
Combining remarkable efficiency with reduced computational resources, this model offers extensive multilingual capabilities covering 8 major languages (English, French, German, Spanish, Italian, Portuguese, Hindi and Thai). Its contextual window of 132,000 tokens enables in-depth analysis of complex documents and long conversations, while maintaining exceptional overall consistency. Optimised to minimise bias and problematic responses.
24 tokens/second

gemma3:27b

Google's revolutionary model offers an optimum balance between power and efficiency, with an exceptional performance/cost ratio for demanding professional applications.
With unrivalled hardware efficiency, this model incorporates native multimodal capabilities and excels in multilingual performance in over 140 languages. Its impressive contextual window of 120,000 tokens makes it the ideal choice for analysing very large documents, document research and any application requiring understanding of extended contexts. Its optimised architecture allows flexible deployment without compromising the quality of results.
84 tokens/second

qwen3-coder:30b

MoE model optimised for software engineering tasks with a very long context.
Advanced agentic capabilities for software engineering tasks, native support for a 250K token context, pre-trained on 7.5T tokens with a high code ratio, and optimised by reinforcement learning to improve code execution rates.
118 tokens/second

qwen3-2507:30b-a3b

Enhanced version of Qwen3-30B's non-thinking mode, with improved general capabilities, knowledge coverage and user alignment.
Significant improvements in following instructions, reasoning, reading comprehension, mathematics, coding and tool use. Native context of 250k tokens.
59 tokens/second

qwen3-next:80b

Qwen's Next 80B FP8 model, optimised for large contexts and reasoning, served via vLLM (A100).
A3B-Instruct variant in FP8, configured with a context of up to 262k tokens, support for function calling, guided decoding (xgrammar) and speculative (qwen3_next_mtp). Deployed on 2×A100 with vLLM.

qwen3-vl:30b

State-of-the-art multimodal model (Qwen3-VL) offering exceptional visual understanding and accurate temporal reasoning.
This Vision-Language model incorporates major innovations (DeepStack, MRoPE) for detailed analysis of images and videos. It excels at complex OCR, object detection, graph analysis, and spatio-temporal reasoning. Its architecture enables native understanding of video content and accurate structured extraction (JSON).

qwen3-vl:32b

High-performance variant of Qwen3-VL, optimised for the most demanding vision tasks.
Offers the same advanced capabilities as the 30B (DeepStack, MRoPE) with increased modelling capacity. Particularly effective for tasks requiring high visual analysis accuracy and deep contextual understanding. Supports text-timestamp alignment for video.

Elm 3:7b

Fully Open model of reference, offering total transparency (data, code, weight) and remarkable efficiency.
OLMo 3-7B is a dense model optimised for efficiency (requiring 2.5 times fewer resources than Llama 3.1 8B for comparable performance). It excels particularly in mathematics and programming. With its 65k token window, it is ideal for tasks requiring full auditability.

elm tree 3:32b

The first fully open reasoning model at this scale, rivalling the best proprietary models.
OLMo 3-32B uses advanced architecture (GQA) to offer exceptional reasoning capabilities. It excels on complex benchmarks (MATH, HumanEvalPlus) and is capable of exposing its thought process (Think variant). It is the preferred choice for critical tasks requiring high performance and total transparency.
26 tokens/second

qwen3-2507:235b

Massive MoE model with 235 billion parameters, with only 22 billion active, offering cutting-edge performance.
Ultra-sparse Mixture-of-Experts architecture with 512 experts. Combines the power of a very large model with the efficiency of a smaller model. Excels at mathematics, coding, and logical reasoning.

Specialised models

Our specialised models are optimised for specific tasks such as code generation, image analysis or structured data processing. They offer an excellent performance/cost ratio for targeted use cases.

embeddinggemma:300m

Google's state-of-the-art embedding model, optimised for its size, ideal for search and semantic retrieval tasks.
Built on Gemma 3, this model produces vector representations of text for classification, clustering and similarity search. Trained on over 100 languages, its small size makes it perfect for resource-constrained environments.
85 tokens/second

gpt-oss:20b

OpenAI's open-weight language model, optimised for efficiency and deployment on consumer hardware.
A Mixture-of-Experts (MoE) model with 21 billion parameters and 3.6 billion active parameters. It offers configurable reasoning effort and agent capabilities.
77 tokens/second

qwen3-2507-think:4b

Qwen3-4B model optimised for reasoning, with improved performance on logic, maths, science and code tasks, and extended context to 250K tokens.
This 'Thinking' version has an increased thought length, making it ideal for highly complex reasoning tasks. It also offers general improvements in following instructions, using tools and generating text.
69 tokens/second

qwen3-2507:4b

Updated version of Qwen3-4B's non-thinking mode, with significant improvements in overall capabilities, extended knowledge coverage and better alignment with user preferences.
Significant improvements in following instructions, logical reasoning, reading comprehension, mathematics, coding and tool use. Native context of 250k tokens.

rnj-1:8b

Model 8B "Open Weight" specialising in coding, mathematics and science (STEM).
RNJ-1 is a dense model with 8.3B parameters trained on 8.4T tokens. It uses global attention and YaRN to provide a context of 32k tokens. It excels at code generation (83.5% HumanEval+) and mathematical reasoning, often outperforming much larger models.

qwen3-vl:2b

Ultra-compact multimodal Qwen3-VL model, bringing advanced vision capabilities to edge devices.
Despite its small size, this model incorporates Qwen3-VL technologies (MRoPE, DeepStack) to deliver impressive image and video analysis. Ideal for mobile or embedded applications requiring OCR, object detection or rapid visual understanding.

qwen3-vl:4b

Balanced Qwen3-VL multimodal model, offering robust vision performance with a small footprint.
Excellent compromise between performance and resources. Capable of analysing complex documents, graphics and videos with high accuracy. Supports structured extraction and visual reasoning.
50 tokens/second

devstral:24b

Devstral (24B FP8) is an agentic LLM specialising in software engineering, co-developed by Mistral AI and All Hands AI.
Deployed in FP8 on 2xL40S (ia03, ia04). Devstral excels at using tools to explore code bases, modify multiple files and drive engineering agents. Based on Mistral Small 3, it offers advanced reasoning and coding capabilities. Configured with Mistral-specific optimisers (tokenizer, parser).

devstral-small-2:24b

Second iteration of Devstral (Small 2), a cutting-edge agentic model for software engineering, deployed on Mac Studio with a massive context.
Optimised for exploring codebases, multi-file editing, and tool usage. Offers performance close to >100B models for code (SWE-bench Verified 68%). Natively supports vision. Deployed with an extended context of 380k tokens to handle entire projects.
28 tokens/second

granite4-small-h:32b

IBM's MoE (Mixture-of-Experts) model, designed as a "workhorse" for everyday business tasks, with excellent efficiency for long contexts.
This hybrid model (Transformer + Mamba-2) with 32 billion parameters (9B active) is optimised for enterprise workflows such as multi-tool agents and customer support automation. Its innovative architecture reduces RAM usage by more than 70% for long contexts and multiple batches.
77 tokens/second

granite4-tiny-h:7b

IBM's ultra-efficient hybrid MoE model, designed for low latency, edge and local applications, and as a building block for agentic workflows.
This 7 billion parameter (1B active) model combines Transformer and Mamba-2 layers for maximum efficiency. It reduces RAM usage by over 70% for long contexts, making it ideal for resource-constrained devices and fast tasks such as function calling.
120 tokens/second

deepseek-ocr

DeepSeek's specialist OCR model, designed for high-precision text extraction with formatting preservation.
Two-stage OCR system (visual encoder + MoE 3B decoder) optimised for converting documents into structured Markdown (tables, formulas). Requires specific pre-processing (Logits Processor) for optimum performance.
24 tokens/second

medgemma:27b

MedGemma is one of Google's most powerful open models for understanding medical text and images, based on Gemma 3.
MedGemma is suitable for tasks such as generating medical imaging reports or answering natural language questions about medical images. MedGemma can be adapted for use cases requiring medical knowledge, such as patient interviewing, triage, clinical decision support and summarisation. Although its basic performance is solid, MedGemma is not yet clinical-grade and will probably require further refinement. Based on the Gemma 3 architecture (native multimodal), this 27B model incorporates a SigLIP image encoder pre-trained on medical data. It supports a context of 128k tokens and is in FP16 for maximum precision.
56 tokens/second

mistral-small3.2:24b

Minor update to Mistral Small 3.1, improving instruction tracking, function calling robustness and reducing repetition errors.
This version 3.2 retains the strengths of its predecessor while making targeted improvements. It is better able to follow precise instructions, produces fewer infinite generations or repetitive responses, and its function calling template is more robust. In other respects, its performance is equivalent to or slightly better than version 3.1.
88 tokens/second

granite3.2-vision:2b

IBM's revolutionary compact computer vision model, capable of directly analysing and understanding visual documents without the need for intermediate OCR technologies.
This compact model achieves the remarkable feat of matching the performance of much larger models across a wide range of visual comprehension tasks. Its ability to directly interpret the visual content of documents - text, tables, graphs and diagrams - without going through a traditional OCR stage represents a significant advance in terms of efficiency and accuracy. This integrated approach significantly reduces recognition errors and provides a more contextual and nuanced understanding of visual content.
29 tokens/second

magistral:24b

Mistral AI's first reasoning model, excelling in domain-specific reasoning, transparent and multilingual.
Ideal for general use requiring longer thought processing and greater accuracy. Useful for legal research, financial forecasting, software development and creative storytelling. Solves multi-step challenges where transparency and accuracy are essential.
37 tokens/second

cogito:32b

Advanced version of the Cogito model, offering considerably enhanced reasoning and analysis capabilities, designed for the most demanding applications in terms of analytical artificial intelligence.
This extended version of the Cogito model takes reasoning and comprehension capabilities even further, offering unrivalled depth of analysis for the most complex applications. Its sophisticated architectural design enables it to tackle multi-step reasoning with rigour and precision, while maintaining remarkable overall consistency. Ideal for mission-critical applications requiring artificial intelligence capable of nuanced reasoning and deep contextual understanding comparable to the analyses of human experts in specialist fields.

granite-embedding:278m

IBM's ultra-light embedding model for semantic search and classification.
Designed to generate dense vector representations of text, this model is optimised for efficiency and performance in semantic similarity, clustering and classification tasks. Its small size makes it ideal for large-scale deployments.

qwen3-embedding:0.6b

Compact embedding model from the Qwen3 family, optimised for efficiency.
The smallest dense model in the Qwen3 family, ideal for fast semantic search.

qwen3-embedding:4b

High-performance embedding model from the Qwen3 family.
Offers greater semantic accuracy thanks to its increased size.

qwen3-embedding:8b

High-performance embedding model from the Qwen3 family.
The largest embedding model in the range, for critical tasks.

granite3-guardian:2b

IBM's compact model specialises in security and compliance, detecting risks and inappropriate content.
Lightweight version of the Guardian family, trained to identify and filter harmful content, bias and security risks in text interactions. Offers robust protection with a small computational footprint. Context limited to 8k tokens.

granite3-guardian:8b

IBM model specialising in security and compliance, offering advanced risk detection capabilities.
Mid-sized model in the Guardian family, providing more in-depth security analysis than version 2B. Ideal for applications requiring rigorous content monitoring and strict compliance.

functiongemma:270m

Specialised micro-model with 270 million parameters, optimised to transform natural language into structured function calls on the Edge.
Based on the Gemma 3 architecture, this model is an expert in function calling. It is designed to be fine-tuned for specific domains, where it can achieve remarkable accuracy (85%) with minimal memory footprint. Ideal as a smart router or local action controller.

ministral-3:3b

Mistral AI's cutting-edge compact model, designed for efficiency in local and edge deployments.
Ministral 3B is a dense model optimised for low-latency local inference. It offers excellent reasoning and comprehension capabilities for its size, while being extremely efficient in terms of memory and computation.

ministral-3:8b

Mid-sized model in the Ministral family, offering an optimal balance between performance and resources.
Ministral 8B offers enhanced reasoning and comprehension capabilities compared to version 3B, while remaining suitable for high-performance local deployments. It is natively multimodal.

ministral-3:14b

The most powerful member of the Ministral family, designed for complex tasks on local infrastructure.
Ministral 14B offers performance close to that of higher-end models in a compact format. It excels at reasoning, coding, and complex multilingual tasks, while being deployable locally.

Model comparison

This comparison table will help you choose the model best suited to your needs, based on various criteria such as context size, performance and specific use cases.

Comparative table of the characteristics and performance of the various AI models available, grouped by category (large-scale models and specialist models).
Model Publisher Parameters Context (k tokens) Vision Agent Reasoning Security Quick * Energy efficiency *
Large models
gpt-oss:120b OpenAI 120B 120000
llama3.3:70b Meta 70B 132000
gemma3:27b Google 27B 120000
qwen3-coder:30b Qwen Team 30B 250000
qwen3-2507:30b-a3b Qwen Team 30B 250000
qwen3-next:80b Qwen Team 80B 262144
qwen3-vl:30b Qwen Team 30B 250000
qwen3-vl:32b Qwen Team 32B 250000
Elm 3:7b AllenAI 7B 65536
elm tree 3:32b AllenAI 32B 65536
qwen3-2507:235b Qwen Team 235B (22B active) 130000
Specialised models
embeddinggemma:300m Google 300M 2048 N.C.
gpt-oss:20b OpenAI 20B 120000
qwen3-2507-think:4b Qwen Team 4B 250000
qwen3-2507:4b Qwen Team 4B 250000
rnj-1:8b Essential AI 8B 32000 N.C.
qwen3-vl:2b Qwen Team 2B 250000
qwen3-vl:4b Qwen Team 4B 250000
devstral:24b Mistral AI & All Hands AI 24B 120000
devstral-small-2:24b Mistral AI & All Hands AI 24B 380000 N.C.
granite4-small-h:32b IBM 32B (9B active) 128000
granite4-tiny-h:7b IBM 7B (1B active) 128000
deepseek-ocr DeepSeek AI 3B 8192
medgemma:27b Google 27B 128000
mistral-small3.2:24b Mistral AI 24B 128000
granite3.2-vision:2b IBM 2B 16384
magistral:24b Mistral AI 24B 40000
cogito:32b Deep Cogito 32B 32000
granite-embedding:278m IBM 278M 512 N.C.
qwen3-embedding:0.6b Qwen Team 0.6B 8192 N.C.
qwen3-embedding:4b Qwen Team 4B 8192 N.C.
qwen3-embedding:8b Qwen Team 8B 8192 N.C.
granite3-guardian:2b IBM 2B 8192 N.C.
granite3-guardian:8b IBM 8B 32000 N.C.
functiongemma:270m Google 270M 32768 N.C.
ministral-3:3b Mistral AI 3B 250000 N.C.
ministral-3:8b Mistral AI 8B 250000 N.C.
ministral-3:14b Mistral AI 14B 250000 N.C.
Legend and explanation
Functionality or capacity supported by the model
Functionality or capability not supported by the model
* Energy efficiency Indicates particularly low energy consumption (< 2.0 kWh/Mtoken)
* Quick Model capable of generating more than 50 tokens per second
Note on performance measures
The speed values (tokens/s) represent performance targets in real-life conditions. Energy consumption (kWh/Mtoken) is calculated by dividing the estimated power of the inference server (in Watts) by the measured speed of the model (in tokens/second), then converted into kilowatt-hours per million tokens (division by 3.6). This method offers a practical comparison of the energy efficiency of different models, to be used as a relative indicator rather than an absolute measure of power consumption.

Recommended use cases

Here are some common use cases and the most suitable models for each. These recommendations are based on the specific performance and capabilities of each model.

Multilingual dialogue

Chatbots and assistants capable of communicating in several languages, with automatic detection, context maintenance throughout the conversation and understanding of linguistic specificities.
Recommended models
  • Llama 3.3
  • Mistral Small 3.2
  • Qwen 3
  • Openai OSS
  • Granite 4

Analysis of long documents

Processing of large documents (>100 pages), maintaining context throughout the text, extracting key information, generating relevant summaries and answering specific content questions
Recommended models
  • Gemma 3
  • Qwen next
  • Qwen 3
  • Granite 4

Programming and development

Generating and optimising code in multiple languages, debugging, refactoring, developing complete functionalities, understanding complex algorithmic implementations and creating unit tests
Recommended models
  • DeepCoder
  • Qwen3 coding
  • Granite 4
  • Devstral

Visual analysis

Direct processing of images and visual documents without OCR pre-processing, interpretation of technical diagrams, graphs, tables, drawings and photos with generation of detailed textual explanations of the visual content
Recommended models
  • deepseek-OCR
  • Mistral Small 3.2
  • Gemma 3
  • Qwen 3 VL

Safety and compliance

Applications requiring specific security capabilities; filtering of sensitive content, traceability of reasoning, RGPD/HDS verification, risk minimisation, vulnerability analysis and compliance with sectoral regulations
Recommended models
  • Granite Guardian
  • Granite 4
  • Devstral
  • Mistral Small 3.2
  • Magistral small

Light and on-board deployments

Applications requiring a minimal resource footprint, deployment on capacity-constrained devices, real-time inference on standard CPUs and integration into embedded or IoT systems
Recommended models
  • Gemma 3n
  • Granite 4 tiny
  • Qwen 3 VL (2B)
Contact us
Cookie policy

We use cookies to give you the best possible experience on our site, but we do not collect any personal data.

Audience measurement services, which are necessary for the operation and improvement of our site, do not allow you to be identified personally. However, you have the option of objecting to their use.

For more information, see our privacy policy.