Our offer Large Language Model as a Service (LLMaaS) gives you access to state-of-the-art language models, inferred using a qualified infrastructure SecNumCloudcertified HDS for hosting health data, and therefore sovereign, calculated in France. Benefit from high performance and optimum security for your AI applications. Your data remains strictly confidential and is not used or stored after processing.

Tarification simple et transparente
0.9 €
par million de tokens en entrée
4 €
par million de tokens en sortie
21 €
par million de tokens de raisonnement
Calculé sur une infrastructure basée en France, qualifiée SecNumcloud et certifiée HDS.
Note sur le prix "Raisonnement" : Ce prix s'applique spécifiquement aux modèles classifiés comme "raisonneurs" ou "hybrides" (modèles avec la capacité "Raisonnement" activée) lorsque le raisonnement est actif et uniquement sur les tokens liés à cette activité.

Modèles de grande taille

Our large models offer state-of-the-art performance for the most demanding tasks. They are particularly well-suited to applications requiring a deep understanding of language, complex reasoning or the processing of long documents.

28 tokens/second

Llama 3.3 70B

State-of-the-art multilingual model developed by Meta, designed to excel at natural dialogue, complex reasoning and nuanced understanding of instructions.
Combining remarkable efficiency with reduced computational resources, this model offers extensive multilingual capabilities covering 8 major languages (English, French, German, Spanish, Italian, Portuguese, Hindi and Thai). Its contextual window of 60,000 tokens enables in-depth analysis of complex documents and long conversations, while maintaining exceptional overall consistency. Optimised to minimise bias and problematic responses.
67 tokens/second

Gemma 3 27B

Google's revolutionary model offers an optimum balance between power and efficiency, with an exceptional performance/cost ratio for demanding professional applications.
With unrivalled hardware efficiency, this model incorporates native multimodal capabilities and excels in multilingual performance in over 140 languages. Its impressive contextual window of 120,000 tokens makes it the ideal choice for analysing very large documents, document research and any application requiring understanding of extended contexts. Its optimised architecture allows flexible deployment without compromising the quality of results.
15 tokens/second

DeepSeek-R1 70B

DeepSeek AI's specialised model designed to excel at tasks requiring rigorous reasoning, algorithmic problem solving and high-quality code generation.
This model stands out for its superior reasoning performance, enabling it to tackle complex intellectual challenges with method and precision. Its increased operational efficiency optimises computing resources while maintaining exceptional results. Its versatility means it can be applied to a variety of practical fields, from science to business to engineering. Particularly noteworthy are its advanced mathematical capabilities, ideal for scientific and engineering applications requiring rigorous quantitative processing.
81.12 tokens/second

Qwen3 30B-A3B FP8

Next-generation MoE FP8 model (3B activated), with hybrid thinking modes and advanced agentic capabilities.
FP8 version of the MoE Qwen3 30B-A3B model. Includes a "Thinking" mode for complex reasoning and a fast "Non-Thinking" mode. Enhanced capabilities in reasoning, code, maths and agent (tools/MCP). Supports over 100 languages. Ideal for an optimal performance/cost balance.

Modèles spécialisés

Our specialised models are optimised for specific tasks such as code generation, image analysis or structured data processing. They offer an excellent performance/cost ratio for targeted use cases.

74 tokens/second

Qwen3 14B

New-generation dense Qwen3 (14B) model, offering equivalent performance to Qwen2.5 32B with improved efficiency.
Part of the Qwen3 series, trained on ~36T tokens. Enhanced reasoning, coding, maths and agent (tools/MCP) capabilities. Supports over 100 languages and hybrid ways of thinking.
76 tokens/second

Gemma 3 12B

An intermediate version of the Gemma 3 model offering an excellent balance between performance and efficiency.
This mid-sized model combines high-quality performance with operational efficiency, offering many of the capabilities of its larger 27B parameter brother in a lighter format. Ideal for deployments requiring quality and speed without the computational resources of larger models.
58 tokens/second

Gemma 3 4B

Google's compact model offering excellent performance in a lightweight, cost-effective format.
This compact version of the Gemma 3 is optimised for resource-constrained deployments while maintaining outstanding performance for its size. Its efficient architecture enables rapid inference on standard hardware, ideal for applications requiring responsiveness and large-scale deployment. Despite its reduced size, it maintains multimodal capabilities for processing both text and images.
43 tokens/second

Gemma 3 1B

Ultra-lightweight micro-model designed for deployment on very low-resource devices.
This ultra-compact model represents the epitome of efficiency, enabling deployments in extremely resource-constrained environments. Despite its minimal size, it offers surprisingly basic capabilities for simple to moderate text tasks, with exceptional inference speed. It also supports integration with external tools via function calling.
41 tokens/second

Lucie-7B-Instruct

Open-source multilingual causal model (7B), fine-tuned from Lucie-7B. Optimised for French.
Fine-tuned on synthetic instructions (ChatGPT, Gemma) and custom prompts. Not optimised for code/maths. Trained on a 4k context but retains the capacity of the base model for 32k. Model under development.
22 tokens/second

Mistral Small 3.1

Mistral AI's compact and responsive model, specially designed to provide fluid and relevant conversational assistance with optimum response speed.
Despite its moderate size, this model delivers remarkable performance that rivals that of many much larger proprietary models. Its ingeniously optimised architecture makes it easy to deploy locally on a variety of infrastructures. With native multimodal capabilities, it can process both text and images without the need for external systems. Its Apache 2.0 licence offers maximum flexibility for commercial deployments and customisations, making it an ideal choice for businesses looking to balance performance and legal constraints.
69 tokens/second

DeepCoder

Open source AI model (14B) by Together AI & Agentica, a credible alternative to proprietary models for code generation.
Outstanding performance in code generation and algorithmic reasoning (60.6% LiveCodeBench Pass@1, 1936 Codeforces, 92.6% HumanEval+). Trained via RL (GRPO+) with progressive context extension (32k -> 64k). Transparent project (open code, dataset, logs). Allows integration of advanced code generation capabilities without relying on proprietary solutions.
56 tokens/second

Granite 3.2 Vision

IBM's revolutionary compact computer vision model, capable of directly analysing and understanding visual documents without the need for intermediate OCR technologies.
This compact model achieves the remarkable feat of matching the performance of much larger models across a wide range of visual comprehension tasks. Its ability to directly interpret the visual content of documents - text, tables, graphs and diagrams - without going through a traditional OCR stage represents a significant advance in terms of efficiency and accuracy. This integrated approach significantly reduces recognition errors and provides a more contextual and nuanced understanding of visual content.
28 tokens/second

Granite 3.3 8B

Granite 8B model fine-tuned by IBM for improved reasoning and instruction tracking, with a 128k token context.
This version 8B of the Granite 3.3 model offers significant gains on generic benchmarks (AlpacaEval-2.0, Arena-Hard) and improvements in mathematics, coding and instruction tracking. It supports 12 languages, Fill-in-the-Middle (FIM) for code, Thinking mode for structured reflection, and function calling. Apache 2.0 licence. Ideal for general tasks and integration into AI assistants.
57 tokens/second

Granite 3.3 2B

Granite 2B model fine-tuned by IBM, optimised for reasoning and instruction tracking, with a context of 128k tokens.
Compact version of Granite 3.3 (2B parameters) offering the same improvements in reasoning, instruction-following, mathematics and coding as version 8B. Supports 12 languages, Fill-in-the-Middle (FIM), Thinking mode, and function calling. Apache 2.0 licence. Excellent choice for lightweight deployments requiring extensive contextual and reasoning capabilities.
71 tokens/second

Granite 3.1 MoE

Innovative IBM model using the Mixture-of-Experts (MoE) architecture to deliver exceptional performance while drastically optimising the use of computational resources.
The MoE (Mixture-of-Experts) architecture of this model represents a significant advance in the optimisation of language models, enabling performance comparable to that of much larger models to be achieved while maintaining a considerably smaller memory footprint. This innovative approach dynamically activates only the relevant parts of the network for each specific task, ensuring remarkable energy and computational efficiency without compromising on the quality of results.
67 tokens/second

Cogito 14B

Deep Cogito model specifically designed to excel at deep reasoning and nuanced contextual understanding tasks, ideal for sophisticated analytical applications.
With excellent logical reasoning capabilities and deep semantic understanding, this model stands out for its ability to grasp the subtleties and implications of complex texts. Its design emphasises coherent reasoning and analytical precision, making it particularly well-suited to applications requiring careful, contextual analysis of information. Its moderate size allows flexible deployment while maintaining high quality performance across a wide range of demanding analytical tasks.
36 tokens/second

Cogito 32B

Advanced version of the Cogito model, offering considerably enhanced reasoning and analysis capabilities, designed for the most demanding applications in terms of analytical artificial intelligence.
This extended version of the Cogito model takes reasoning and comprehension capabilities even further, offering unrivalled depth of analysis for the most complex applications. Its sophisticated architectural design enables it to tackle multi-step reasoning with rigour and precision, while maintaining remarkable overall consistency. Ideal for mission-critical applications requiring artificial intelligence capable of nuanced reasoning and deep contextual understanding comparable to the analyses of human experts in specialist fields.
38 tokens/second

QwQ-32B

32-billion-parameter model enhanced by reinforcement learning (RL) to excel at reasoning, coding, maths and agent tasks.
This model uses an innovative RL approach with outcome-based rewards (accuracy checkers for maths, code execution for coding) and multi-step training to improve general abilities without degrading specialised performance. It includes agent capabilities for using tools and adapting reasoning. Apache 2.0 licence.
67 tokens/second

DeepSeek-R1 14B

A compact, efficient version of the DeepSeek-R1 model, offering an excellent compromise between performance and light weight for deployments requiring flexibility and responsiveness.
Representing an optimal balance between performance and efficiency, this compact version of the DeepSeek-R1 retains the key reasoning and analysis qualities of its larger counterpart, while enabling lighter and more flexible deployment. Its carefully optimised design ensures quality results across a wide range of tasks, while minimising computational resource requirements. This combination makes it the ideal choice for applications requiring agile deployment without major compromise on core capabilities.
37 tokens/second

DeepSeek-R1 32B

An intermediate version of the DeepSeek-R1 model, offering a strategic balance between the advanced capabilities of the 70B version and the efficiency of the 14B version, for optimum versatility and performance.
This mid-range version of the DeepSeek-R1 model intelligently combines power and efficiency, delivering significantly improved performance over the 14B version while maintaining a lighter footprint than the 70B version. This strategic position in the range makes it a particularly attractive option for deployments requiring advanced reasoning capabilities without the hardware requirements of larger models. Its versatility enables it to excel at a wide range of tasks, from text analysis to structured content generation.
63 tokens/second

Cogito 3B

Compact version of the Cogito model, optimised for reasoning on devices with limited resources.
Offers the reasoning capabilities of the Cogito family in a very lightweight format (3 billion parameters), ideal for embedded deployments or CPU environments.

Granite Embedding

IBM's ultra-light embedding model for semantic search and classification.
Designed to generate dense vector representations of text, this model is optimised for efficiency and performance in semantic similarity, clustering and classification tasks. Its small size makes it ideal for large-scale deployments.

Granite 3 Guardian 2B

IBM's compact model specialises in security and compliance, detecting risks and inappropriate content.
Lightweight version of the Guardian family, trained to identify and filter harmful content, bias and security risks in text interactions. Offers robust protection with a small computational footprint. Context limited to 8k tokens.

Granite 3 Guardian 8B

IBM model specialising in security and compliance, offering advanced risk detection capabilities.
Mid-sized model in the Guardian family, providing more in-depth security analysis than version 2B. Ideal for applications requiring rigorous content monitoring and strict compliance.
53 tokens/second

Qwen 2.5 0.5B

Ultra-lightweight micro-model from the Qwen 2.5 family, designed for maximum efficiency on constrained equipment.
The smallest model in the Qwen 2.5 series, offering basic language processing capabilities with a minimal footprint. Ideal for very simple tasks on IoT or mobile devices.
107 tokens/second

Qwen 2.5 1.5B

Very compact model from the Qwen 2.5 family, offering a good performance/size balance for light deployments.
Slightly larger model than version 0.5B, offering enhanced capabilities while remaining highly efficient. Suitable for mobile or embedded applications requiring a little more power.
68 tokens/second

Qwen 2.5 14B

Versatile, medium-sized model from the Qwen 2.5 family, good balance between performance and resources.
Offers strong multilingual capabilities and general understanding in a 14B format. Suitable for a wide range of applications requiring a reliable model without the requirements of very large models.
36 tokens/second

Qwen 2.5 32B

Powerful model from the Qwen 2.5 family, offering advanced understanding and generation capabilities.
Version 32B of Qwen 2.5, providing improved performance over version 14B, particularly in reasoning and following complex instructions, while remaining lighter than the 72B model.
57 tokens/second

Qwen 2.5 3B

Compact, efficient model from the Qwen 2.5 family, suitable for general tasks with limited resources.
Offers a good compromise between the capabilities of the 1.5B and 14B models. Ideal for applications requiring a good general understanding in a light, fast format.
58 tokens/second

Qwen3 0.6b

Compact, efficient model from the Qwen3 family, suitable for general-purpose tasks with limited resources.
Offers a good compromise between the capabilities of ultra-compact models and larger models. Ideal for applications requiring good general understanding in a light, fast format.
84 tokens/second

Qwen3 1.7b

A very compact model in the Qwen3 family, offering a good balance between performance and size for light deployments.
Slightly larger model than version 0.6B, offering enhanced capabilities while remaining highly efficient. Suitable for mobile or embedded applications requiring a little more power.
50 tokens/second

Qwen3 4b

Compact model in the Qwen3 family, offering excellent performance in a lightweight, cost-effective format.
This compact version of the Qwen3 model is optimised for resource-constrained deployments while maintaining outstanding performance for its size. Its efficient architecture enables rapid inference on standard hardware.
34 tokens/second

Qwen3 8b

Qwen3 8B model offering a good balance between performance and efficiency for general tasks.
Version 8B of Qwen3, offering enhanced reasoning, coding, maths and agent capabilities. Supports over 100 languages and hybrid ways of thinking.
24 tokens/second

Foundation-Sec-8B

Specialised language model for cybersecurity, optimised for efficiency.
Foundation-Sec-8B model (Llama-3.1-FoundationAI-SecurityLLM-base-8B) based on Llama-3.1-8B, pre-trained on a cybersecurity corpus. Designed for threat detection, vulnerability assessment, security automation, etc. Optimised for local deployment. Context of 16k tokens.

Comparaison des modèles

This comparison table will help you choose the model best suited to your needs, based on various criteria such as context size, performance and specific use cases.

Modèle Editeur Paramètres Contexte (k tokens) Vision Agent Raisonnement Sécurité Rapide * Efficience énergétique *
Modèles de grande taille
Llama 3.3 70B Meta 70B 60000
Gemma 3 27B Google 27B 120000
DeepSeek-R1 70B DeepSeek AI 70B 60000
Qwen3 30B-A3B FP8 Qwen Team 30B-A3B 60000
Modèles spécialisés
Qwen3 14B Qwen Team 14B 60000
Gemma 3 12B Google 12B 120000
Gemma 3 4B Google 4B 120000
Gemma 3 1B Google 1B 32000
Lucie-7B-Instruct OpenLLM-France 7B 32000
Mistral Small 3.1 Mistral AI 24B 60000
DeepCoder Agentica x Together AI 14B 32000
Granite 3.2 Vision IBM 2B 16384
Granite 3.3 8B IBM 8B 60000
Granite 3.3 2B IBM 2B 120000
Granite 3.1 MoE IBM 3B 32000
Cogito 14B Deep Cogito 14B 32000
Cogito 32B Deep Cogito 32B 32000
QwQ-32B Qwen Team 32B 32000
DeepSeek-R1 14B DeepSeek AI 14B 32000
DeepSeek-R1 32B DeepSeek AI 32B 32000
Cogito 3B Deep Cogito 3B 32000
Granite Embedding IBM 278M 32000 N.C.
Granite 3 Guardian 2B IBM 2B 8192 N.C.
Granite 3 Guardian 8B IBM 8B 32000 N.C.
Qwen 2.5 0.5B Qwen Team 0.5B 32000
Qwen 2.5 1.5B Qwen Team 1.5B 32000
Qwen 2.5 14B Qwen Team 14B 32000
Qwen 2.5 32B Qwen Team 32B 32000
Qwen 2.5 3B Qwen Team 3B 32000
Qwen3 0.6b Qwen Team 0.6B 32000
Qwen3 1.7b Qwen Team 1.7B 32000
Qwen3 4b Qwen Team 4B 32000
Qwen3 8b Qwen Team 8B 60000
Foundation-Sec-8B Foundation AI - Cisco 8B 16000
Légende et explication
: Fonctionnalité ou capacité supportée par le modèle
: Fonctionnalité ou capacité non supportée par le modèle
* Efficience énergétique : Indique une consommation énergétique particulièrement faible (< 2.0 kWh/Mtoken)
* Rapide : Modèle capable de générer plus de 50 tokens par seconde
Note sur les mesures de performance
Les valeurs de vitesse (tokens/s) représentent des cibles de performance en conditions réelles. La consommation énergétique (kWh/Mtoken) est calculée en divisant la puissance estimée du serveur d'inférence (en Watts) par la vitesse mesurée du modèle (en tokens/seconde), puis convertie en kilowattheures par million de tokens (division par 3.6). Cette méthode offre une comparaison pratique de l'efficience énergétique des différents modèles, à utiliser comme indicateur relatif plutôt que comme mesure absolue de la consommation électrique.

Cas d'usage recommandés

Here are some common use cases and the most suitable models for each. These recommendations are based on the specific performance and capabilities of each model.

Multilingual dialogue

Chatbots and assistants capable of communicating in several languages, with automatic detection, context maintenance throughout the conversation and understanding of linguistic specificities.
Modèles recommandés
  • Llama 3.3
  • Mistral Small 3.1
  • Qwen 2.5
  • Granite 3.3

Analysis of long documents

Processing of large documents (>100 pages), maintaining context throughout the text, extracting key information, generating relevant summaries and answering specific content questions
Modèles recommandés
  • Gemma 3
  • DeepSeek-R1
  • Granite 3.3

Programming and development

Generating and optimising code in multiple languages, debugging, refactoring, developing complete functionalities, understanding complex algorithmic implementations and creating unit tests
Modèles recommandés
  • DeepCoder
  • QwQ
  • DeepSeek-R1
  • Granite 3.3

Visual analysis

Direct processing of images and visual documents without OCR pre-processing, interpretation of technical diagrams, graphs, tables, drawings and photos with generation of detailed textual explanations of the visual content
Modèles recommandés
  • Granite 3.2 Vision
  • Mistral Small 3.1
  • Gemma 3

Safety and compliance

Applications with strict security requirements, traceability of reasoning, RGPD/HDS/SecNumCloud verification, risk minimisation, vulnerability analysis and compliance with sector-specific regulations
Modèles recommandés
  • Granite Guardian
  • Granite 3.3
  • Lucie
  • Mistral Small 3.1

Light and on-board deployments

Applications requiring a minimal resource footprint, deployment on capacity-constrained devices, real-time inference on standard CPUs and integration into embedded or IoT systems
Modèles recommandés
  • Gemma 3
  • Granite 3.1 MoE
  • Granite Guardian
  • Granite 3.3
Cookie policy

We use cookies to give you the best possible experience on our site, but we do not collect any personal data.

Audience measurement services, which are necessary for the operation and improvement of our site, do not allow you to be identified personally. However, you have the option of objecting to their use.

For more information, see our privacy policy.