# JetMoE

<figure><img src="/files/ju8JkH8vdpO5xGX3XdOR" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/OlFWwZAYLgwdmHx6wYac" alt="" width="375"><figcaption></figcaption></figure>

[GitHub](https://github.com/myshell-ai/JetMoE) / [Website](https://research.myshell.ai/jetmoe) / [HuggingFace](https://huggingface.co/jetmoe/jetmoe-8b) / [Online Demo on Lepton AI](< https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat>) / [Technical Report](Https://arxiv.org/pdf/2404.07413.pdf)

JetMoE-8B is **trained with less than $ 0.1 million** (\*) **cost but outperforms LLaMA2-7B from Meta AI**, who has multi-billion-dollar training resources. LLM training can be **much cheaper than people previously thought**.

It is **fully open-sourced and academia-friendly** because:

* It **only uses public datasets** for training, and the code is open-sourced. No proprietary resource is needed.
* It **can be finetuned with very limited compute budget** (e.g., consumer-grade GPU) that most labs can afford.

JetMoE-8B **only has 2.2B active parameters** during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.

(\*) We used a 96×H100 GPU cluster for 2 weeks, which cost \~$0.08 million.

<figure><img src="/files/uK7hNJYjVcIJ2Az6g2VZ" alt=""><figcaption></figcaption></figure>

## Benchmarks

We use the same evaluation methodology as in the Open LLM leaderboard. For MBPP code benchmark, we use the same evaluation methodology as in the LLaMA2 and Deepseek-MoE paper. The results are shown below:

| Model           | Activate Params | Training Tokens | Open LLM Leaderboard Avg | ARC       | Hellaswag | MMLU     | TruthfulQA | WinoGrande | GSM8k    | MBPP     | HumanEval |
| --------------- | --------------- | --------------- | ------------------------ | --------- | --------- | -------- | ---------- | ---------- | -------- | -------- | --------- |
| Shot            |                 |                 |                          | 25        | 10        | 5        | 0          | 5          | 5        | 3        | 0         |
| Metric          |                 |                 |                          | acc\_norm | acc\_norm | acc      | mc2        | acc        | acc      | Pass\@1  | Pass\@1   |
| LLaMA2-7B       | 7B              | 2T              | 51.0                     | 53.1      | 78.6      | 46.9     | 38.8       | 74         | 14.5     | 20.8     | 12.8      |
| LLaMA-13B       | 13B             | 1T              | 51.4                     | **56.2**  | **80.9**  | 47.7     | 39.5       | **76.2**   | 7.6      | 22.0     | 15.8      |
| DeepseekMoE-16B | 2.8B            | 2T              | 51.1                     | 53.2      | 79.8      | 46.3     | 36.1       | 73.7       | 17.3     | 34.0     | **25.0**  |
| Gemma-2B        | 2B              | 2T              | 46.4                     | 48.4      | 71.8      | 41.8     | 33.1       | 66.3       | 16.9     | 28.0     | 24.4      |
| JetMoE-8B       | 2.2B            | 1.25T           | **53.0**                 | 48.7      | 80.5      | **49.2** | **41.7**   | 70.2       | **27.8** | **34.2** | 14.6      |

| Model              | MT-Bench Score |
| ------------------ | -------------- |
| GPT-4              | 9.014          |
| GPT-3.5-turbo      | 7.995          |
| Claude-v1          | 7.923          |
| **JetMoE-8B-chat** | **6.681**      |
| Llama-2-13b-chat   | 6.650          |
| Vicuna-13b-v1.3    | 6.413          |
| Wizardlm-13b       | 6.353          |
| Llama-2-7b-chat    | 6.269          |

To our surprise, despite the lower training cost and computation, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B. Compared to a model with similar training and inference computation, like Gemma-2B, JetMoE-8B achieves better performance.

## Model Usage

To load the models, you need install this package:

```
pip install -e .
```

Then you can load the model with the following code:

```
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
from jetmoe import JetMoEForCausalLM, JetMoEConfig, JetMoEForSequenceClassification

AutoConfig.register("jetmoe", JetMoEConfig)
AutoModelForCausalLM.register(JetMoEConfig, JetMoEForCausalLM)
AutoModelForSequenceClassification.register(JetMoEConfig, JetMoEForSequenceClassification)

tokenizer = AutoTokenizer.from_pretrained('jetmoe/jetmoe-8b')
model = AutoModelForCausalLM.from_pretrained('jetmoe/jetmoe-8b')
```

## Model Details

Please refer to the technical report <https://arxiv.org/pdf/2404.07413.pdf> for model details and training details.

## Collaboration

**If you have great ideas but need more resources (GPU, data, funding, etc.)**, welcome to contact **MyShell.ai** via [Zengyi Qin](https://www.qinzy.tech/). **MyShell.ai** is open to collaborations and are actively supporting high-quality open-source projects.

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.myshell.ai/technology/jetmoe.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
