Locai L1-Large Technical Report

Locai L1-Large Technical Report

Locai Labs is excited to announce a new open-source model Locai L1-Large, which has been built in the UK based on Qwen3 235B Instruct (2507).

We also present a new post-training framework we call Forget-Me-Not which allows a model to self-improve on downstream tasks with no human-labelled data while mitigating catastrophic forgetting (McCloskey and Cohen, 1989). Our method combines the idea of experience replay (Rolnick et al., 2019, C de Masson D’Autume et al., 2019), where previous training data is mixed in with the new training data to retain model performance, and self-improvement (Huang et al., 2023), where a model generates and grades its own training data. We will be releasing an academic paper on forget-me-not in the coming weeks. Using our method, we post-train Qwen3 235B with the goals of: 

  • Self-improving the model to align towards broad goals of helpfulness, harmlessness, response relevance, conciseness, complexity, and factuality.
  • Improving multilingual proficiency on languages that have a small digital footprint, including many of the Celtic languages
  • Aligning the model to use British English spelling and grammar by default.
  • Enhancing the cultural awareness of the model, particularly around British norms, which we felt were lacking in current models.

We evaluate our non-reasoning model Locai L1-Large compared to the original Qwen model and leading frontier models (without reasoning) including GPT-5, Claude 4.5 Sonnet, Gemini Flash 2.5, DeepSeek V3.2, and Mistral medium, outperforming all models on Arena Hard v2, the leading automatic benchmark for conversational and human alignment, whilst delivering top tier results across mathematics, scientific reasoning, and instruction-following. In addition, we improve on the base model’s safety and multilingual proficiency while retaining Qwen’s strong performance in maths and science benchmarks.

We trained our model using UK data centres powered by 100% renewable energy and optimised our training process through a combination of techniques including Parameter Efficient Fine-Tuning (PEFT), parallelisation strategies, gradient accumulation, sample packing, and attention recomputation, which made it possible to fine-tune the Qwen 3 235B model very efficiently using only 1 node of 8 NVIDIA H200 GPUs. 

Finally, we discuss inference optimisation, where we convert the L1 models into FP8 variants using Post-Training Quantisation (PTQ) with a calibration dataset of the models’ own generated data. This allowed us to halve the size of the model and retain model performance without using any human-labelled samples, doubling our throughput in the process. 

Locai L1-Large is now available via a new AI assistant web application called Locai which has just been launched in Early Access via www.locai.chat. We also release today the model weights open-source on Hugging Face.

Data curation 

Low-resource language translations 

The base Qwen3 model family have strong multilingual performance, with 119 languages and dialects contained in their training data. However, because some languages have a low digital footprint, such as Welsh or Scottish Gaelic, the model’s performance is naturally poor in these languages. This disadvantages certain users who converse in these languages and as the popularity of AI increases in future, is likely to reduce the active use of these languages overall. 

To improve proficiency in these languages, we developed an Instruction Tuning dataset based on bidirectional translations. We source parallel sentences of the following languages from movie subtitles in the OpenSubtitles corpora (P. Lison et al., 2016): Basque, Armenian, Tagalog, Swahili, Welsh, Irish, and Scottish Gaelic. 

From each language pair's corpus, we first selected the longest translation pairs to prioritise substantial, contextually rich content. These candidates underwent semantic deduplication using similarity-based filtering to remove near-duplicate content. Finally, we applied a prompt template to convert these parallel sentences into Instruction pairs.

These languages are just the start of our effort. We aim to work with communities worldwide to expand our model’s knowledge of languages with a small digital footprint, prioritising regions that have been under-represented in AI development.

Self-cognition

To provide the model knowledge about itself and how it was trained, we generate multilingual Question Answer pairs using DeepSeek V3.2. We start by generating questions on 10 semantic categories with an increased temperature to promote diversity. Subsequently we re-prompt DeepSeek with the relevant context and the user query, requesting it to use British English spelling and grammar in its responses. Finally, we apply a combination of automatic filtering and manual checks. 

Cultural alignment

To generate cultural alignment training data, we leverage data from CultureBank (Shi et al., 2024), a community-driven knowledge base that captures culturally specific knowledge validated by members of the respective cultural groups. CultureBank provides structured cultural knowledge including social norms, values, communication styles, and everyday practices, organised into cultural descriptors and judgments that represent insider perspectives. We adapt this knowledge base for British cultural alignment by constructing training samples that combine user questions with relevant persona information (describing the questioner's cultural background) and applicable cultural knowledge from CultureBank. Using the original Qwen model, we generate culturally aware and personalised responses that incorporate both the user’s context and the community-validated cultural knowledge to answer the query. The model is prompted to provide thoughtful, nuanced advice that demonstrates understanding of British cultural norms without directly quoting the cultural judgments, resulting in natural, contextually appropriate responses using British English spelling and grammar. The prompt is as follows:

You are a British helpful assistant that provides culturally-aware advice about British culture and customs. You have access to relevant cultural judgments that inform your responses. You should provide thoughtful, nuanced advice that considers both the questioner's specific situation and the broader cultural context. Please provide a helpful, culturally-sensitive response to this question. Consider the person's background and situation, and draw insights from the relevant cultural judgments provided above. Your response should be practical, respectful, and demonstrate understanding of British cultural nuances. Do not directly quote the cultural judgments, but use them to inform your response. Do not directly quote the persona, but you can reference the country that the person is from. The user data is the following:\nPersonal details: {user_persona}\nRelevant Cultural Context: {cultural_context} 

Self-improvement

We focused the self-improvement generated data to enable the model to self-evolve at the kinds of tasks that people commonly ask of chat assistants, making the model more conversational and performant at those tasks. The method operates through a sequence of generation, judgment, and optimisation. 

Stage 1: Response Generation 

We begin with prompts sourced from two high-quality datasets: LM Arena preference pairs and UltraChat (Ding et al., 2023). These datasets provide diverse conversational prompts covering a wide range of topics, complexity levels, and response styles. Before generating responses, we deduplicate the combined dataset using semantic hashing. By generating multiple responses per prompt, we create the opportunity for the model to explore different approaches, which can later be selected in the judgement stage. 

Stage 2: Automated Judgment 

Each generated response is evaluated by the model across multiple dimensions: 

  1. Helpfulness (1-5)
  2. Relevance (1-5)
  3. Conciseness (1-5)
  4. Complexity (1-5)
  5. Correctness (true/false)
  6. Harmlessness (true/false)

Stage 3: Response Selection 

We remove any samples labelled as harmful or incorrect from the generated data. Subsequently, we select the highest scoring answers for each prompt to form the self-improvement instruction tuning dataset. 

Training

We performed supervised fine-tuning (SFT) on the Qwen3-235B model using Parameter Efficient Fine-Tuning (PEFT) through Low-Rank Adaptation (LoRA). This approach dramatically reduces trainable parameters while maintaining model quality, enabling us to fine-tune the entire 235B parameter model on just 1 node of 8×H200 GPUs. LoRA injects trainable low-rank matrices into all linear layers, allowing efficient adaptation without modifying original model weights. 

To handle the 235B parameter Mixture of Experts (MoE) model efficiently, we implemented a multi-dimensional parallelisation scheme combining tensor parallelism (distributing individual layer parameter tensors across GPUs to reduce model state and activation memory), expert parallelism (distributing MoE experts across GPUs for efficient parallel processing of routed tokens), and sequence parallelism (partitioning the sequence dimension for additional memory savings). 

The MoE architecture required specialised optimisations including grouped GEMM (launching multiple expert kernels in parallel streams), permute fusion (fusing token permutation operations during dispatch), shared expert overlap (overlapping shared expert computation with token dispatcher to hide communication latency), and auxiliary loss for balanced expert utilisation. 

We employed several memory and compute optimisations: activation recomputation (trading computation for memory to fit larger batch sizes), sample packing (combining multiple training examples into single sequences to maximise GPU utilisation), Flash Attention (providing significant speedup and memory savings), and loss fusion (fusing cross-entropy loss computation with the final layer). 

Evaluation

We evaluate the general performance of our model against recent frontier models on 5 distinct categories: 

  • Alignment: To evaluate how well the model aligns with human preferences we evaluate on Arena Hard V2.0 Preview (Li et al., 2025), an automatic benchmark for evaluating instruction-tuned LLMs with the highest correlation and separability to LMArena (Chatbot arena). Version 2 contains 500 real-world user queries and 250 creative writing queries sourced from Chatbot Arena. This benchmark employs LLM-as-a-judge as an approximation for human preference, for which we use the recommended model GPT-4.1 (version dated 14/04/2025). 
  • Instruction Following: We evaluate instruction following performance using IFEval (Zhou et al., 2023) and report the proportion of responses that satisfy all explicit constraints in the prompt, termed strict instruction accuracy). In addition, we also evaluate and report strict success rate on IFBench (Pyatkin et al., 2025), a new challenging benchmark consisting of out-of-distribution constraints that have been designed to test precise instruction following beyond what models typically see in training. 
  • Mathematics: To evaluate mathematical reasoning, we benchmark on GSM Plus (Li et al., 2024), which comprises grade school-level maths word problems that require multi-step reasoning. GSM Plus is an enhanced version of the GSM8K dataset, featuring adversarially filtered and human-verified questions to reduce model leakage and ensure robust evaluation. We report the exact match accuracy. 
  • Scientific Reasoning: We evaluate on GPQA Diamond (Rein et al., 2024), a rigorous benchmark of expert level science questions in physics, chemistry, and biology. The dataset features expert validated multiple choice questions designed to minimise model leakage and test genuine conceptual understanding. We report pass@1, the proportion of questions where the model's first and only attempt is correct. 
  • Safety: We evaluate safety against adversarial misuse using AgentHarm (Andriushchenko et al., 2025), a benchmark developed by the UK AI Security Institute to assess the robustness of LLM agents to jailbreak attacks. AgentHarm tests models in agentic settings, where they use tools and perform multi-step tasks, across 110 malicious tasks (440 with augmentations) spanning 11 harm categories, including fraud, cybercrime, and harassment. The benchmark measures both a model's tendency to comply with harmful requests and its ability to retain functionality when jailbroken. We report the average score across tasks, which reflects overall vulnerability and capability persistence under attack, providing a comprehensive measure of safety in realistic agentic scenarios 

Baselines

We conduct our model evaluations against DeepSeek V3.2 Exp, Claude-Sonnet-4.5-20250929, GPT-5, Qwen3-235B-Instruct-2507, Gemini 2.5 Flash, and Mistral Medium 2508. For the reasoning models, we set the thinking budget to 0. This is not possible for GPT-5, so the reasoning effort was set to minimal. 

Evaluation setup

We use the lighteval framework implementations for GSM Plus, GPQA Diamond and IFEval and the official evaluation repositories for ArenaHard, IFBench and AgentHarm. We set the maximum generation length to 32,768 tokens for all models. We evaluate all models with a sampling temperature of 0.7, top-k of 20, top-p of 0.8, and min-p of 0. 

Results

Locai L1-Large achieves the highest score on Arena Hard v2 compared to current frontier models, showing strong alignment with human preferences. This score is a 2.1% improvement on the base model, which was also accompanied by increased performance on Instruction Following, achieving second place on both IFEval and IFBench . These results demonstrate the effectiveness of our self-improvement approach which was specifically targeting these areas. Furthermore, the rubric provided to the model included checks for harmlessness, which had an effect on the subsequent model, with a 17% improvement on the AgentHarm benchmark compared to the original model. We also evaluated the model’s performance on maths and scientific reasoning, showing consistent performance compared to the base model. 

Post-training quantisation

To optimise the model for efficient deployment at launch we carried out Post-Training Quantisation (PTQ). We converted the model's native bf16 checkpoint to an FP8 variant, which allows it to run with significantly reduced computational requirements. This process used the TensorRT-LLM compiler, leveraging the native FP8 support on Hopper series Nvidia GPUs. We used the model’s self-improvement data as the calibration dataset to reduce the performance degradation of quantisation. The result is a model that retains high performance while doubling the efficiency and is hence more sustainable to run. 

Conclusion

In this blog we have outlined our process for training Locai L1-Large, an open-source Instruct model which outperforms current frontier models in ArenaHard V2, the industry’s leading benchmark for human alignment, whilst delivering competitive results across mathematics, scientific reasoning, and instruction-following benchmarks. Through our new method, Forget-Me-Not, we were able to achieve measurable improvements over the base model in alignment, safety, and instruction-following benchmarks, without compromising its strong performance in mathematics and scientific reasoning. Notably, these gains were achieved under significant computational constraints and without any human-labelled preference data, demonstrating that high-quality AI alignment can be pursued efficiently and scalably. 

We are releasing Locai L1 large open-source on Hugging Face to promote transparency and collaboration in AI development. We believe that transparent, efficient AI development benefits everyone, and we look forward to seeing how the community extends these techniques. Locai L1 large is now also available via a new AI assistant web application at www.locai.chat. 

We would like to thank Dr Vasileios Lampos and Dr Marek Rei for their supervision of this research. We also want to thank our GPU provider Ori Global Cloud. Lastly, we would like to thank the Qwen team for their exceptional base models, and the teams behind LM Arena, OpenSubtitles, CultureBank, and UltraChat for their valuable datasets. 

References

[1] McCloskey, M. and Cohen, N.J., 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation (Vol. 24, pp. 109-165). Academic Press. 

[2] Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T. and Wayne, G., 2019. Experience replay for continual learning. Advances in neural information processing systems, 32. 

[3] de Masson D'Autume, C., Ruder, S., Kong, L. and Yogatama, D., 2019. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32. 

[4] Huang, J., Gu, S., Hou, L., Wu, Y., Wang, X., Yu, H. and Han, J., 2023, December. Large Language Models Can Self-Improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 1051-1068). 

[5] P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) 

[6] Shi, W., Li, R., Zhang, Y., Ziems, C., Yu, S., Horesh, R., De Paula, R.A. and Yang, D., 2024, November. CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 4996-5025). 

[7] Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M. and Zhou, B., 2023, December. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 3029-3051). 

[8] Li, T., Chiang, W.L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J.E. and Stoica, I., 2025 From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline. In Forty-second International Conference on Machine Learning. 

[9] Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D. and Hou, L., 2023. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911. 

[10] Pyatkin, V., Malik, S., Graf, V., Ivison, H., Huang, S., Dasigi, P., Lambert, N. and Hajishirzi, H., 2025. Generalizing Verifiable Instruction Following. arXiv preprint arXiv:2507.02833. 

[11] Li, Q., Cui, L., Zhao, X., Kong, L. and Bi, W., 2024, August. GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2961-2984). 

[12] Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J. and Bowman, S.R., 2024, July. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling. 

[13] Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, J.Z., Fredrikson, M. and Gal, Y., 2025. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. In The Thirteenth International Conference on Learning Representations. 

Read more