Our offer Large Language Model as a Service (LLMaaS) gives you access to state-of-the-art language models, inferred using a qualified infrastructure SecNumCloudcertified HDS for hosting health data, and therefore sovereign, calculated in France. Benefit from high performance and optimum security for your AI applications. Your data remains strictly confidential and is not used or stored after processing.

Simple, transparent pricing
0.9 €
per million input tokens
4 €
per million tokens issued
21 €
per million reasoning tokens
Calculated on an infrastructure based in France, qualified SecNumcloud and HDS certified.
Note on the "Reasoning" award : This price applies specifically to models classified as reasoners" or "hybrids" (models with the "Reasoning" capability "ability activated) when reasoning is active and only on tokens linked to this activity. only on tokens linked to this activity.

Large models

Our large models offer state-of-the-art performance for the most demanding tasks. They are particularly well-suited to applications requiring a deep understanding of language, complex reasoning or the processing of long documents.

26 tokens/second

Llama 3.3 70B

State-of-the-art multilingual model developed by Meta, designed to excel at natural dialogue, complex reasoning and nuanced understanding of instructions.
Combining remarkable efficiency with reduced computational resources, this model offers extensive multilingual capabilities covering 8 major languages (English, French, German, Spanish, Italian, Portuguese, Hindi and Thai). Its contextual window of 60,000 tokens enables in-depth analysis of complex documents and long conversations, while maintaining exceptional overall consistency. Optimised to minimise bias and problematic responses.
17 tokens/second

Qwen3 235B

High-volume model from the new Qwen3 generation, offering extended capabilities for the most complex tasks.
Part of the Qwen3 series. This 235-billion-parameter model is designed to excel at deep reasoning, complex code generation, and nuanced understanding across broad contexts. Supports over 100 languages and hybrid modes of thinking.
12 tokens/second

DeepSeek-R1 671B

Extremely large DeepSeek AI model, designed for the ultimate in reasoning and generation.
DeepSeek-R1 671B is one of the largest open models, designed for the most demanding reasoning tasks and for generating text of exceptional quality.
20 tokens/second

Gemma 3 27B

Google's revolutionary model offers an optimum balance between power and efficiency, with an exceptional performance/cost ratio for demanding professional applications.
With unrivalled hardware efficiency, this model incorporates native multimodal capabilities and excels in multilingual performance in over 140 languages. Its impressive contextual window of 120,000 tokens makes it the ideal choice for analysing very large documents, document research and any application requiring understanding of extended contexts. Its optimised architecture allows flexible deployment without compromising the quality of results.
106 tokens/second

Qwen3 30B-A3B FP8

Next-generation MoE FP8 model (3B activated), with hybrid thinking modes and advanced agentic capabilities.
FP8 version of the MoE Qwen3 30B-A3B model. Includes a "Thinking" mode for complex reasoning and a fast "Non-Thinking" mode. Enhanced capabilities in reasoning, code, maths and agent (tools/MCP). Supports over 100 languages. Ideal for an optimal performance/cost balance.
21 tokens/second

DeepSeek-R1 70B

DeepSeek AI Model 70B
DeepSeek-R1 70B is designed for complex reasoning and generation tasks.
18 tokens/second

Qwen2.5-VL 32B

The most powerful version of the Qwen2.5-VL series, offering cutting-edge visual understanding and agentique capabilities.
This 32-billion-parameter vision-language model is designed for the most demanding tasks, combining deep visual understanding with advanced reasoning capabilities to interact with graphical interfaces and analyse complex documents.
15 tokens/second

Qwen2.5-VL 72B

The most powerful version of the Qwen2.5-VL series, offering state-of-the-art visual understanding and agentique capabilities for the most demanding tasks.
This 72-billion-parameter vision-language model is designed for the most demanding tasks, combining deep visual understanding with advanced reasoning capabilities to interact with graphical interfaces and analyse complex documents.

Specialised models

Our specialised models are optimised for specific tasks such as code generation, image analysis or structured data processing. They offer an excellent performance/cost ratio for targeted use cases.

68 tokens/second

Qwen3 14B

New-generation dense Qwen3 (14B) model, offering equivalent performance to Qwen2.5 32B with improved efficiency.
Part of the Qwen3 series, trained on ~36T tokens. Enhanced reasoning, coding, maths and agent (tools/MCP) capabilities. Supports over 100 languages and hybrid ways of thinking.
56 tokens/second

Gemma 3 12B

An intermediate version of the Gemma 3 model offering an excellent balance between performance and efficiency.
This mid-sized model combines high-quality performance with operational efficiency, offering many of the capabilities of its larger 27B parameter brother in a lighter format. Ideal for deployments requiring quality and speed without the computational resources of larger models.
57 tokens/second

Gemma 3 4B

Google's compact model offering excellent performance in a lightweight, cost-effective format.
This compact version of the Gemma 3 is optimised for resource-constrained deployments while maintaining outstanding performance for its size. Its efficient architecture enables rapid inference on standard hardware, ideal for applications requiring responsiveness and large-scale deployment. Despite its reduced size, it maintains multimodal capabilities for processing both text and images.
112 tokens/second

Gemma 3 1B

Ultra-lightweight micro-model designed for deployment on very low-resource devices.
This ultra-compact model represents the epitome of efficiency, enabling deployments in extremely resource-constrained environments. Despite its minimal size, it offers surprisingly basic capabilities for simple to moderate text tasks, with exceptional inference speed. It also supports integration with external tools via function calling.
4 tokens/second

Lucie-7B-Instruct

Open-source multilingual causal model (7B), fine-tuned from Lucie-7B. Optimised for French.
Fine-tuned on synthetic instructions (ChatGPT, Gemma) and custom prompts. Not optimised for code/maths. Trained on a 4k context but retains the capacity of the base model for 32k. Model under development.
35 tokens/second

Mistral Small 3.1

Mistral AI's compact and responsive model, specially designed to provide fluid and relevant conversational assistance with optimum response speed.
Despite its moderate size, this model delivers remarkable performance that rivals that of many much larger proprietary models. Its ingeniously optimised architecture makes it easy to deploy locally on a variety of infrastructures. With native multimodal capabilities, it can process both text and images without the need for external systems. Its Apache 2.0 licence offers maximum flexibility for commercial deployments and customisations, making it an ideal choice for businesses looking to balance performance and legal constraints.
35 tokens/second

Mistral Small 3.2

Minor update to Mistral Small 3.1, improving instruction tracking, function calling robustness and reducing repetition errors.
This version 3.2 retains the strengths of its predecessor while making targeted improvements. It is better able to follow precise instructions, produces fewer infinite generations or repetitive responses, and its function calling template is more robust. In other respects, its performance is equivalent to or slightly better than version 3.1.
50 tokens/second

Mistral Small 3.2

Minor update to Mistral Small 3.1, improving instruction tracking, function calling robustness and reducing repetition errors.
This version 3.2 retains the strengths of its predecessor while making targeted improvements. It is better able to follow precise instructions, produces fewer infinite generations or repetitive responses, and its function calling template is more robust. In other respects, its performance is equivalent to or slightly better than version 3.1.
64 tokens/second

DeepCoder

Open source AI model (14B) by Together AI & Agentica, a credible alternative to proprietary models for code generation.
Outstanding performance in code generation and algorithmic reasoning (60.6% LiveCodeBench Pass@1, 1936 Codeforces, 92.6% HumanEval+). Trained via RL (GRPO+) with progressive context extension (32k -> 64k). Transparent project (open code, dataset, logs). Allows integration of advanced code generation capabilities without relying on proprietary solutions.
48 tokens/second

Granite 3.2 Vision

IBM's revolutionary compact computer vision model, capable of directly analysing and understanding visual documents without the need for intermediate OCR technologies.
This compact model achieves the remarkable feat of matching the performance of much larger models across a wide range of visual comprehension tasks. Its ability to directly interpret the visual content of documents - text, tables, graphs and diagrams - without going through a traditional OCR stage represents a significant advance in terms of efficiency and accuracy. This integrated approach significantly reduces recognition errors and provides a more contextual and nuanced understanding of visual content.
30 tokens/second

Granite 3.3 8B

Granite 8B model fine-tuned by IBM for improved reasoning and instruction tracking, with a 128k token context.
This version 8B of the Granite 3.3 model offers significant gains on generic benchmarks (AlpacaEval-2.0, Arena-Hard) and improvements in mathematics, coding and instruction tracking. It supports 12 languages, Fill-in-the-Middle (FIM) for code, Thinking mode for structured reflection, and function calling. Apache 2.0 licence. Ideal for general tasks and integration into AI assistants.
45 tokens/second

Granite 3.3 2B

Granite 2B model fine-tuned by IBM, optimised for reasoning and instruction tracking, with a context of 128k tokens.
Compact version of Granite 3.3 (2B parameters) offering the same improvements in reasoning, instruction-following, mathematics and coding as version 8B. Supports 12 languages, Fill-in-the-Middle (FIM), Thinking mode, and function calling. Apache 2.0 licence. Excellent choice for lightweight deployments requiring extensive contextual and reasoning capabilities.
25 tokens/second

Magistral 24B

Mistral AI's first reasoning model, excelling in domain-specific reasoning, transparent and multilingual.
Ideal for general use requiring longer thought processing and greater accuracy. Useful for legal research, financial forecasting, software development and creative storytelling. Solves multi-step challenges where transparency and accuracy are essential.
74 tokens/second

Granite 3.1 MoE

Innovative IBM model using the Mixture-of-Experts (MoE) architecture to deliver exceptional performance while drastically optimising the use of computational resources.
The MoE (Mixture-of-Experts) architecture of this model represents a significant advance in the optimisation of language models, enabling performance comparable to that of much larger models to be achieved while maintaining a considerably smaller memory footprint. This innovative approach dynamically activates only the relevant parts of the network for each specific task, ensuring remarkable energy and computational efficiency without compromising on the quality of results.
60 tokens/second

cogito:14b

Deep Cogito model specifically designed to excel at deep reasoning and nuanced contextual understanding tasks, ideal for sophisticated analytical applications.
With excellent logical reasoning capabilities and deep semantic understanding, this model stands out for its ability to grasp the subtleties and implications of complex texts. Its design emphasises coherent reasoning and analytical precision, making it particularly well-suited to applications requiring careful, contextual analysis of information. Its moderate size allows flexible deployment while maintaining high quality performance across a wide range of demanding analytical tasks.
32 tokens/second

Cogito 32B

Advanced version of the Cogito model, offering considerably enhanced reasoning and analysis capabilities, designed for the most demanding applications in terms of analytical artificial intelligence.
This extended version of the Cogito model takes reasoning and comprehension capabilities even further, offering unrivalled depth of analysis for the most complex applications. Its sophisticated architectural design enables it to tackle multi-step reasoning with rigour and precision, while maintaining remarkable overall consistency. Ideal for mission-critical applications requiring artificial intelligence capable of nuanced reasoning and deep contextual understanding comparable to the analyses of human experts in specialist fields.
18 tokens/second

Qwen3 32B

Powerful next-generation Qwen3 model, offering advanced reasoning, code and agentic capabilities, with extended context.
Part of the Qwen3 series, trained on a vast corpus of data. This 32-billion-parameter model is designed to excel at complex tasks, support over 100 languages and incorporate hybrid modes of thinking for improved performance.
35 tokens/second

QwQ-32B

32-billion-parameter model enhanced by reinforcement learning (RL) to excel at reasoning, coding, maths and agent tasks.
This model uses an innovative RL approach with outcome-based rewards (accuracy checkers for maths, code execution for coding) and multi-step training to improve general abilities without degrading specialised performance. It includes agent capabilities for using tools and adapting reasoning. Apache 2.0 licence.
62 tokens/second

DeepSeek-R1 14B

A compact, efficient version of the DeepSeek-R1 model, offering an excellent compromise between performance and light weight for deployments requiring flexibility and responsiveness.
Representing an optimal balance between performance and efficiency, this compact version of the DeepSeek-R1 retains the key reasoning and analysis qualities of its larger counterpart, while enabling lighter and more flexible deployment. Its carefully optimised design ensures quality results across a wide range of tasks, while minimising computational resource requirements. This combination makes it the ideal choice for applications requiring agile deployment without major compromise on core capabilities.
33 tokens/second

DeepSeek-R1 32B

An intermediate version of the DeepSeek-R1 model, offering a strategic balance between the advanced capabilities of the 70B version and the efficiency of the 14B version, for optimum versatility and performance.
This mid-range version of the DeepSeek-R1 model intelligently combines power and efficiency, delivering significantly improved performance over the 14B version while maintaining a lighter footprint than the 70B version. This strategic position in the range makes it a particularly attractive option for deployments requiring advanced reasoning capabilities without the hardware requirements of larger models. Its versatility enables it to excel at a wide range of tasks, from text analysis to structured content generation.
55 tokens/second

Cogito 3B

Compact version of the Cogito model, optimised for reasoning on devices with limited resources.
Offers the reasoning capabilities of the Cogito family in a very lightweight format (3 billion parameters), ideal for embedded deployments or CPU environments.

Granite Embedding

IBM's ultra-light embedding model for semantic search and classification.
Designed to generate dense vector representations of text, this model is optimised for efficiency and performance in semantic similarity, clustering and classification tasks. Its small size makes it ideal for large-scale deployments.

Granite 3 Guardian 2B

IBM's compact model specialises in security and compliance, detecting risks and inappropriate content.
Lightweight version of the Guardian family, trained to identify and filter harmful content, bias and security risks in text interactions. Offers robust protection with a small computational footprint. Context limited to 8k tokens.

Granite 3 Guardian 8B

IBM model specialising in security and compliance, offering advanced risk detection capabilities.
Mid-sized model in the Guardian family, providing more in-depth security analysis than version 2B. Ideal for applications requiring rigorous content monitoring and strict compliance.
162 tokens/second

Qwen 2.5 0.5B

Ultra-lightweight micro-model from the Qwen 2.5 family, designed for maximum efficiency on constrained equipment.
The smallest model in the Qwen 2.5 series, offering basic language processing capabilities with a minimal footprint. Ideal for very simple tasks on IoT or mobile devices.
102 tokens/second

Qwen 2.5 1.5B

Very compact model from the Qwen 2.5 family, offering a good performance/size balance for light deployments.
Slightly larger model than version 0.5B, offering enhanced capabilities while remaining highly efficient. Suitable for mobile or embedded applications requiring a little more power.
61 tokens/second

Qwen 2.5 14B

Versatile, medium-sized model from the Qwen 2.5 family, good balance between performance and resources.
Offers strong multilingual capabilities and general understanding in a 14B format. Suitable for a wide range of applications requiring a reliable model without the requirements of very large models.
31 tokens/second

Qwen 2.5 32B

Powerful model from the Qwen 2.5 family, offering advanced understanding and generation capabilities.
Version 32B of Qwen 2.5, providing improved performance over version 14B, particularly in reasoning and following complex instructions, while remaining lighter than the 72B model.
64 tokens/second

Qwen 2.5 3B

Compact, efficient model from the Qwen 2.5 family, suitable for general tasks with limited resources.
Offers a good compromise between the capabilities of the 1.5B and 14B models. Ideal for applications requiring a good general understanding in a light, fast format.
112 tokens/second

Qwen3 0.6b

Compact, efficient model from the Qwen3 family, suitable for general-purpose tasks with limited resources.
Offers a good compromise between the capabilities of ultra-compact models and larger models. Ideal for applications requiring good general understanding in a light, fast format.
88 tokens/second

Qwen3 1.7b

A very compact model in the Qwen3 family, offering a good balance between performance and size for light deployments.
Slightly larger model than version 0.6B, offering enhanced capabilities while remaining highly efficient. Suitable for mobile or embedded applications requiring a little more power.
49 tokens/second

Qwen3 4b

Compact model in the Qwen3 family, offering excellent performance in a lightweight, cost-effective format.
This compact version of the Qwen3 model is optimised for resource-constrained deployments while maintaining outstanding performance for its size. Its efficient architecture enables rapid inference on standard hardware.
33 tokens/second

Qwen3 8b

Qwen3 8B model offering a good balance between performance and efficiency for general tasks.
Version 8B of Qwen3, offering enhanced reasoning, coding, maths and agent capabilities. Supports over 100 languages and hybrid ways of thinking.
65 tokens/second

Qwen2.5-VL 3B

Compact Vision-Language model, a high-performance solution for edge AI.
Qwen2.5-VL is Qwen's new flagship vision-language model, marking a significant advance on Qwen2-VL. Key features - Visual understanding (common objects, text, graphics, icons, layouts). Visual agent capabilities (reasoning, dynamic direction of tools for computer/telephone use). Precise visual localisation (bounding boxes, points, stable JSON output). Structured output generation (invoices, forms, tables). Qwen2.5-VL-3B outperforms even Qwen2-VL version 7B.
35 tokens/second

Qwen2.5-VL 7B

High-performance Vision-Language model, outperforming GPT-4o-mini on certain tasks.
Qwen2.5-VL is Qwen's new flagship vision-language model, marking a significant advance on Qwen2-VL. Key features - Visual understanding (common objects, text, graphics, icons, layouts). Visual agent capabilities (reasoning, dynamic direction of tools for computer/telephone use). Precise visual localisation (bounding boxes, points, stable JSON output). Structured output generation (invoices, forms, tables). Qwen2.5-VL-7B-Instruct outperforms GPT-4o-mini in many tasks, and is particularly good at understanding documents and diagrams.
21 tokens/second

Foundation-Sec-8B

Specialised language model for cybersecurity, optimised for efficiency.
Foundation-Sec-8B model (Llama-3.1-FoundationAI-SecurityLLM-base-8B) based on Llama-3.1-8B, pre-trained on a cybersecurity corpus. Designed for threat detection, vulnerability assessment, security automation, etc. Optimised for local deployment. Context of 16k tokens.
45 tokens/second

devstral 24B

Devstral is an agentic LLM for software engineering tasks.
Devstral is an agentic LLM for software engineering tasks. It excels at using tools to explore code bases, modify multiple files and feed software engineering agents. It is refined from Mistral Small 3.1, featuring a long context window of up to 128k tokens.
30 tokens/second

Cogito 8B

An intermediate-sized model in the Cogito family, offering a good balance between reasoning capabilities and efficiency.
This version 8B is positioned between compact and larger models, offering robust reasoning capabilities for a wide range of analytical applications without requiring the resources of larger models.
31 tokens/second

Llama 3.1 8B

The basic model in the Llama 3.1 family, offering solid performance for its size.
Based on the Llama 3.1 architecture, this 8B model is an excellent starting point for general tasks, offering good quality generation and comprehension in an efficient format.
71 tokens/second

Phi-4 Reasoning 14B

Part of Microsoft's Phi family, specialising in complex reasoning and mathematics.
This model is specifically trained to excel at tasks that require multi-step logical reasoning, making it particularly good at maths, logic and coding problems.

Model comparison

This comparison table will help you choose the model best suited to your needs, based on various criteria such as context size, performance and specific use cases.

Model Publisher Parameters Context (k tokens) Vision Agent Reasoning Security Quick * Energy efficiency *
Large models
Llama 3.3 70B Meta 70B 60000
Qwen3 235B Qwen Team 235B 60000
DeepSeek-R1 671B DeepSeek AI 671B 16000
Gemma 3 27B Google 27B 120000
Qwen3 30B-A3B FP8 Qwen Team 30B-A3B 32000
DeepSeek-R1 70B DeepSeek AI 70B 32000
Qwen2.5-VL 32B Qwen Team 32B 120000
Qwen2.5-VL 72B Qwen Team 72B 128000
Specialised models
Qwen3 14B Qwen Team 14B 32000
Gemma 3 12B Google 12B 120000
Gemma 3 4B Google 4B 120000
Gemma 3 1B Google 1B 32000
Lucie-7B-Instruct OpenLLM-France 7B 32000
Mistral Small 3.1 Mistral AI 24B 120000
Mistral Small 3.2 Mistral AI 24B 120000
Mistral Small 3.2 Mistral AI 24B 120000
DeepCoder Agentica x Together AI 14B 32000
Granite 3.2 Vision IBM 2B 16384
Granite 3.3 8B IBM 8B 60000
Granite 3.3 2B IBM 2B 120000
Magistral 24B Mistral AI 24B 40000
Granite 3.1 MoE IBM 3B 32000
cogito:14b Deep Cogito 14B 32000
Cogito 32B Deep Cogito 32B 32000
Qwen3 32B Qwen Team 32B 40000
QwQ-32B Qwen Team 32B 32000
DeepSeek-R1 14B DeepSeek AI 14B 32000
DeepSeek-R1 32B DeepSeek AI 32B 32000
Cogito 3B Deep Cogito 3B 32000
Granite Embedding IBM 278M 512 N.C.
Granite 3 Guardian 2B IBM 2B 8192 N.C.
Granite 3 Guardian 8B IBM 8B 32000 N.C.
Qwen 2.5 0.5B Qwen Team 0.5B 32000
Qwen 2.5 1.5B Qwen Team 1.5B 32000
Qwen 2.5 14B Qwen Team 14B 32000
Qwen 2.5 32B Qwen Team 32B 32000
Qwen 2.5 3B Qwen Team 3B 32000
Qwen3 0.6b Qwen Team 0.6B 32000
Qwen3 1.7b Qwen Team 1.7B 32000
Qwen3 4b Qwen Team 4B 32000
Qwen3 8b Qwen Team 8B 32000
Qwen2.5-VL 3B Qwen Team 3.8B 128000
Qwen2.5-VL 7B Qwen Team 7B (8.3B) 128000
Foundation-Sec-8B Foundation AI - Cisco 8B 16384
devstral 24B Mistral AI & All Hands AI 24B 120000
Cogito 8B Deep Cogito 8B 32000
Llama 3.1 8B Meta 8B 32000
Phi-4 Reasoning 14B Microsoft 14B 32000
Legend and explanation
Functionality or capacity supported by the model
Functionality or capability not supported by the model
* Energy efficiency Indicates particularly low energy consumption (< 2.0 kWh/Mtoken)
* Quick Model capable of generating more than 50 tokens per second
Note on performance measures
The speed values (tokens/s) represent performance targets under real-life conditions. Energy consumption (kWh/Mtoken) is calculated by calculated by dividing the estimated power of the inference server (in Watts) by the measured speed of the model (in tokens/second), and then converted to kilowatt-hours per million tokens (division by 3.6). This method offers a practical comparison of the energy efficiency of different models, to be used as a relative indicator rather than an absolute as an absolute measure of power consumption.

Recommended use cases

Here are some common use cases and the most suitable models for each. These recommendations are based on the specific performance and capabilities of each model.

Multilingual dialogue

Chatbots and assistants capable of communicating in several languages, with automatic detection, context maintenance throughout the conversation and understanding of linguistic specificities.
Recommended models
  • Llama 3.3
  • Mistral Small 3.1
  • Qwen 2.5
  • Granite 3.3

Analysis of long documents

Processing of large documents (>100 pages), maintaining context throughout the text, extracting key information, generating relevant summaries and answering specific content questions
Recommended models
  • Gemma 3
  • DeepSeek-R1
  • Granite 3.3

Programming and development

Generating and optimising code in multiple languages, debugging, refactoring, developing complete functionalities, understanding complex algorithmic implementations and creating unit tests
Recommended models
  • DeepCoder
  • QwQ
  • DeepSeek-R1
  • Granite 3.3
  • Devstral

Visual analysis

Direct processing of images and visual documents without OCR pre-processing, interpretation of technical diagrams, graphs, tables, drawings and photos with generation of detailed textual explanations of the visual content
Recommended models
  • Granite 3.2 Vision
  • Mistral Small 3.1
  • Gemma 3
  • Qwen2.5-VL

Safety and compliance

Applications requiring specific security capabilities; filtering of sensitive content, traceability of reasoning, RGPD/HDS verification, risk minimisation, vulnerability analysis and compliance with sectoral regulations
Recommended models
  • Granite Guardian
  • Granite 3.3
  • Devstral
  • Mistral Small 3.1
  • Magistral 24b
  • Foundation-Sec-8B

Light and on-board deployments

Applications requiring a minimal resource footprint, deployment on capacity-constrained devices, real-time inference on standard CPUs and integration into embedded or IoT systems
Recommended models
  • Gemma 3
  • Granite 3.1 MoE
  • Granite Guardian
  • Granite 3.3
Cookie policy

We use cookies to give you the best possible experience on our site, but we do not collect any personal data.

Audience measurement services, which are necessary for the operation and improvement of our site, do not allow you to be identified personally. However, you have the option of objecting to their use.

For more information, see our privacy policy.