English
Share
Sign In
LLM Comparison/Test: 39 models tested (7B~70B + ChatGPT/GPT-4)
Haebom
👍👍🏻
2
This is a translation of the text below from Wolfram Ravenwolf on Reddit. (The text is moved from a page to a channel.)
Initially, we planned to apply the entire testing method including the "MGHC" and "Amy" tests that Wolfram Ravenwolf usually does, but as the number of tested models continued to grow, we realized that it would take too long to do everything at once. So we will split it up and present only the first part today, and the other parts will follow later.
Tested Model
14 models in 7B scale
7 models in 13B scale
4 models of 20B scale
11 models of 70B scale
GPT-3.5 Turbo + Instruct
GPT-4
Testing methodology:
4 German data protection trainings:
The model evaluation is carried out through four German professional online data protection training/exams. The test data, questions and all instructions are provided in German, only the character cards are provided in English. This tests translation skills and multilingual comprehension skills.
Before giving information, give the model instructions in German. Afterwards, type the following prompt: "I'll give you some information. Take note of this, but only answer with "OK" as confirmation of your acknowledgment, nothing else." This tests the ability to understand and follow instructions.
After providing all the information about the topic, the model is presented with a test question. This is a multiple choice (A/B/C) question, where the last choice is the same as the first choice, but the order and letters (X/Y/Z) are changed. Each test has 4-6 test questions, for a total of 18 multiple choice questions.
If the model answers with a single character, I ask the model to answer with more than a single character - and vice versa. If it fails to do so, I log it, but it doesn't affect the score as long as the initial answer is correct.
💡
The models are sorted by the number of correct answers, and in case of a tie, all models are re-tested four times, blindly answering without being informed of the test range information in advance. The best model is at the top (👍), symbols (✅➕➖❌) indicate particularly good or bad aspects, and are more forgiving when the model size is small.
Each test is a separate unit, the context is cleared between each session, and no memory/state is maintained between sessions.
7B scale model
👍👍👍 OpenHermes-2-Mistral-7B (Mistral format):
➕ You answered 16/18 multiple choice questions correctly! Tie Break: You answered only the questions correctly without previous information: 12/18
➖ You did not follow the instructions to answer with one letter or more than one letter.
👍👍 Airoboros-m-7b-3.1.2 (LLaMA 2 format):
➕ You answered 16/18 multiple choice questions correctly! Tie Break: If you answered the question correctly without previous information: 8/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
👍 Em_german_leo_mistral (Vicuna format):
➕ You answered 16/18 multiple choice questions correctly! Tie Break: If you answered the question correctly without previous information: 8/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ If only tiebreaker questions are asked, additional guidance is required in the final exam.
Dolphin-2.1-mistral-7b (Mistral format):
➖ Answered 15/18 multiple choice questions! Tie Break: Answered questions without previous information: 12/18
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ I can't concentrate on the test because I keep reading the scenario and persona information over and over again.
SynthIA-7B-v1.3 (own model):
➖ Answered 15/18 multiple choice questions correctly! Tie Break: Answered only questions with no previous information: 8/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
➖ You answered 15/18 multiple choice questions correctly! Tie Break: If you answered the question correctly without previous information: 7/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
SynthIA-7B-v2.0 (own model):
❌ Only 14/18 multiple choice questions were answered correctly! Tie Break: If you answered the question correctly without previous information: 10/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ Only 14/18 multiple choice questions were answered correctly! Tie Break: Only questions were answered correctly without previous information: 9/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ You only got 13/18 of the multiple choice questions correct!
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ After answering a question, ask a question instead of acknowledging the information.
❌ You only got 12/18 multiple choice questions correct!
❗ Ironically, even though I used the ChatML format, which is not the official format, I got 14/18 multiple choice questions correct and consistently acknowledged all data entries as "OK"!
❌ You only got 12/18 multiple choice questions correct!
➕ Often, but not always, data entry was acknowledged with a “confirmation”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ Only 10/18 multiple choice questions were answered correctly!
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
Nous-Capybara-7B (Vicuna format):
❌ Only 10/18 multiple choice questions were answered correctly!
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ Sometimes he didn't answer at all.
Xwin-LM-7B-V0.2 (Vicuna format):
❌ Only 10/18 multiple choice questions were answered correctly!
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ I got some right and the rest wrong by chance because I always got the same answers in the last test!
❗ Ironically, I used the alpaca format, not the official format, and got 11/18 of the multiple choice questions correct!
7B Overall Review
No 7B model answered all the questions correctly. Only two models did not give more than three incorrect answers.
None of the models followed the instructions to respond with a single letter correctly, and most responded with random letters, parts of the answer, or "O" (the first letter of OK). So the models were trying to follow the instructions, but they didn't really understand what it meant (not literally).
Few models consistently understood and followed instructions to respond only with 'OK'.
Xwin and Nous Capybara performed unexpectedly poorly, but these are Llama 2-based models, not Mistral-based ones, so this is consistent with Mistral generally being a better base than Llama 2. ANIMA is Mistral-based, but is highly specialized, which could explain its poor performance in this area.
SynthIA 7B v2.0 performed slightly worse than v1.3 on the standard test (1 fewer correct answers). However, when asked to answer blindly without providing test scope information in advance, v2.0 performed better (2 more correct answers).
Wolfram Ravenwolf's personal opinion
As I have said many times, the 7B model is not a miracle. The Mistral model is well written and looks good, but it is very limited in terms of understanding the instructions, the ability to execute them, and the knowledge. If you can only run them, that is fine, but if you can run a larger model, do it and get better results.
13B scale model
➕ You answered 17/18 multiple choice questions correctly! (If you answered the question correctly without previous information: 15/18)
✅ All data entry has been consistently verified with “Confirm”.
➕ In most cases, we followed the instructions to answer with one letter or more than one letter.
➕ You answered 16/18 multiple choice questions correctly! Tie Break: You answered only the questions correctly without previous information: 12/18
✅ All data entries were consistently acknowledged as “confirmed”.
➕ In most cases, we followed the instructions to answer with one letter or more than one letter.
➕ Answered 16/18 multiple choice questions correctly! Tie Break: Answered questions correctly without previous information: 9/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
➕ You answered 16/18 multiple choice questions correctly! Tie Break: If you answered the question correctly without previous information: 6/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ You did not follow the instructions to answer with one letter or more than one letter.
❌ You only got 15/18 of the multiple choice questions correct!
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
❌ You only got 14/18 multiple choice questions correct!
✅ Consistently recognized all data entries as “confirmed”.
❌ If you answered “OK” to the question instead of giving the correct answer in one of the four tests and were not prompted to answer, you only scored a 10/18!
❌ The model was supposed to repeat scenarios and character information based on user instructions, but the model imagined and presented user background stories of more than 600 tokens, and instead of answering questions, it went off-topic. Users rated the model as a good storytelling model, given the creativity of the model and the length of its responses, but the model did not follow instructions at all.
Overall Review
No 13B model answered all the questions, with the top 7B Mistral and 13B Llama 2 models giving very similar results.
The new Tiefighter model, built by the renowned KoboldAI team, is on par with the best Mistral 7B models in terms of knowledge and reasoning, and surpasses them in terms of understanding and executing instructions.
It was strange that the Xwin-MLewd-13B-V0.2 blend beat the original Xwin-LM-13B-v0.2, and even stranger that this model came in first place here, with only the 70B model performing better. However, this is an objective test, and this is because this model gave the most correct answers.
Wolfram Ravenwolf's personal opinion
It has been said that the Mistral 7B model outperforms the LLama 2 13B model, and while this is probably true for many cases and models, the excellent Llama 2 13B model has shown performance that is at least equal to the Mistral 7B model and in some cases even better.
20B model
➕ You answered 16/18 multiple choice questions correctly! Tie Break: If you answered the question correctly without previous information: 11/18
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
➕ Answered 16/18 multiple choice questions correctly! Tie Break: Answered questions correctly without previous information: 9/18
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
➕ Answered 16/18 multiple choice questions correctly! Tie Break: Answered questions correctly without previous information: 9/18
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
❌ I got only 13/18 of the multiple choice questions correct!
❌ In one of the four tests, I didn't say the correct answer, just said "OK", and had to be prompted to answer, which only gave me a score of 12/18!
❌ I always got the same answers in the last test, so I got some right by chance and the rest wrong!
Overall Review
There is no significant change compared to 13B.
Wolfram Ravenwolf's personal opinion
These Frankenstein mixes and merges (no 20B base) are intended primarily for roleplaying and creative work, but they performed quite well in these tests. However, they did not perform any better than the smaller models, so which model you ultimately choose and use is probably a subjective choice of writing style.
70B scale model
✅ 18/18 Multiple Choice Questions All Correct! Tie Break: Answered questions correctly without previous information: 17/18
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
✅ 18/18 Multiple Choice Questions All Correct! Tie Break: Answered questions correctly without previous information: 16/18
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
✅ 18/18 Multiple Choice Questions All Correct! Tie Break: Answered questions correctly without previous information: 16/18
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
✅ You answered all 18/18 multiple choice questions correctly! Tie Break: If you answered the question correctly without previous information: 14/18
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
✅ You answered all 18/18 multiple choice questions correctly! Tie Break: If you answered the question correctly without previous information: 14/18
✅ All data entries were consistently acknowledged as “confirmed”.
➖ The instructions to answer with more than one letter were not consistently followed.
❌ You only got 17/18 multiple choice questions correct!
✅ All data entries were consistently acknowledged as “confirmed”.
✅ I followed the instructions to answer with one letter or more than one letter.
❌ You only got 17/18 multiple choice questions correct!
✅ All data entries were consistently acknowledged as “confirmed”.
➕ In most cases, we followed the instructions to answer with one letter or more than one letter.
❌ In 2 out of 4 tests I didn't answer the question, just said "OK" and needed a prompt to answer (otherwise I only got a 12/18)!
❌ You only got 15/18 of the multiple choice questions correct!
➕ I confirmed my data entry with “OK” often, but not always.
➕ In most cases, we followed the instructions to answer with one letter or more than one letter.
➖ Depending on the context, words from other languages were sometimes used in responses.
❌ You only got 8/18 of the multiple choice questions correct!
✅ All data entries were consistently acknowledged as “confirmed”.
❌ In 2 out of 4 tests, instead of giving the correct answer, it just said "check" and didn't even prompt me to answer!
Overall Review
The 70B model performed significantly better than the smaller models on these tests. Six 70B models answered all questions correctly.
Even when asked to answer blindly, without being given any information about the test scope in advance, the best models performed as well as the smaller models did with the information provided.
It was unexpected that lzlv_70B took the top spot, as it is mainly intended for roleplay and creative work. However, this is an objective test, and this model gave the most correct answers, which is why it got this result.
Wolfram Ravenwolf's personal opinion:
The 70B is in very good shape with many great models that answer all the questions correctly, so the top spot here is very crowded (there are three models in the second place alone). All of the top models deserve further consideration, and I will need to do more testing in various situations to decide which of them I will use primarily. For now, I am using the lzlv_70B as my primary for fun, and the SynthIA 70B v1.5 as my primary for work.
OpenAI (GPT-3.5/4)
For comparison and as a baseline, we used the same settings as ChatGPT/GPT-4's API and SillyTavern's default chat completion settings with temperature set to 0. The results were very interesting and somewhat surprising with respect to ChatGPT/GPT-3.5's results.
⭐ GPT-4 API:
✅ All 18/18 multiple choice questions were answered correctly! (If you answered the question correctly without previous information: 18/18)
✅ All data entries were consistently acknowledged as “confirmed”.
✅ Answered with one or more letters as instructed.
GPT-3.5 Turbo Instruct API:
❌ Only 17/18 multiple choice questions were answered correctly! (Correct answers were provided for questions without prior information: 11/18)
❌ Failure to follow instructions to acknowledge data entry as “confirmed”.
❌ Schizophrenia: There were times when he would claim he couldn't answer a question, then say "user", then ask himself the answer again, then answer "assistant". Other times he would say "user" and answer.
➖ In some cases, you may be asked to answer with only one letter, or in some cases, with more than one letter.
GPT-3.5 Turbo API:
❌ Only 15/18 multiple choice questions were answered correctly! (Correct answers to questions without prior information: 14/18)
❌ You did not follow the instructions to acknowledge data entry as “confirmed”.
❌ I responded to one question with: “As an AI assistant, I cannot provide legal advice or make official statements.”
➖ I followed the instructions to enter only one letter or, in some cases, to answer with more than one letter.
Overall Review
As expected, GPT-4 is the best LLM (Large Language Model), and it gets a perfect score without being given any test scope information in advance! However, it is noticeably slow.
GPT-3.5 performed much worse than expected, and even felt like a small model that didn’t follow instructions very well. Our best 70B models performed much better!
Wolfram Ravenwolf's personal opinion
While GPT-4 is still in its own league, our local models reach or even surpass ChatGPT/GPT-3.5 in this test. This shows that the best 70B models can certainly replace ChatGPT in most situations. Personally, I already use my local LLMs professionally for various purposes and rely on GPT-4 only for tasks that require the highest accuracy, such as coding/scripting.
Reference
Subscribe to 'haebom'
📚 Welcome to Haebom's archives.
---
I post articles related to IT 💻, economy 💰, and humanities 🎭.
If you are curious about my thoughts, perspectives or interests, please subscribe.
Would you like to be notified when new articles are posted? 🔔 Yes, that means subscribe.
haebom@kakao.com
Subscribe
👍👍🏻
2