Tried Phi-4, It didn’t Impress

Phi-4 14B has been recently released. Benchmarks look promising, e.g. it beats GPT-4o in Math:

I tested an 8-bit quantized version using my LLM Chess Eval which tests both chess proficiency and instruction following.

The model scored 0 draws and 0 wins (in 30 games) against a random player lasting on average 7.7 moves (before going off the rails and breaking the prompt instructions). For comparison, Gemma 2 9B scored 4 draws in 30 games (lasting on average 69 moves).

I was mostly interested not in chess proficiency (it seems like all chat models are bad here) but in instruction following consistency. Lately, I have observed a tendency with newer models struggling to adhere to prompt instructions while filling the responses with verbosity (e.g. Nemotron 70B). My overall impression of small models that shine across evals is not that good when used in real life.

And that is exactly the outcome I received when testing Phi-4 – lots of words in replies and poor instruction following:

Model	Mistakes ▼	Tokens
phi-4	387.94	333.54
gemma-2-9b-it-8bit	36.63	58.12

When compared to smaller (and older) Gemma 9B, Phi-4 used to spill out almost 6 times more tokens while deciding on the next move and making 10 times more mistakes.

Here’s the prompt used to instruct the model on making a move:

You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')

Every time an LLM response can’t be matched to a specific action (via RegEx) OR if the requested move is not valid this response is registered as a mistake. If an LLM fails 3 times in a single dialog, the game is terminated and LLM is given a loss. What can be easier than picking one of 3 actions (and one of the legal moves out of a list)? As my tests show models struggle even with these very basic rules.

Here’s the full leaderboard.

P.S>

I suspect that many AI shops have been baking the CoT (chain of thought) into model post-training, making them more verbose, a form of test-time-compute strategy that is being talked about lately. However, the downside is the models are flooded with excessive tokens and get lost.

Source link
lol

Tried Phi-4, It didn’t Impress

By stp2y

Leave a Reply Cancel reply