Getting Recommendations from ChatGPT for Next Fest Demos

NameRound 1Round 2Round 3
Object Impermanence1. (4.5-5.5)4. (4.3-4.8)5. (4.3-4.8)
Database Detective: Minor Crimes Division2. (4.0-5.5)1. (4.8-5.5)1. (4.8-5.6)
Chicken Scratch3. (???)2. (4.5-5.2)-
The Sorting Bureau4. (3.5-4.8)5. (3.8-4.6)6. (3.5-4.6)
What the Dog Remembers5. (1.5-5.0) [1]7. (2.0-5.0)-
ReStory: Chill Electronics Repairs6. (???) [2]6. (3.5-4.5)4. (3.8-5.0)
Michi-Suji Puzzle: Michi Connect7. (???)3. (4.3-5.1)3. (4.3-5.1)
Onimusha-8. (3.5-4.3)7. (3.0-4.3)
Realm of Taiwu-9. (3.0-4.8) [3]-
[1] wild card ranging from 1.5 "What was that?" to 5.0 "This was unexpectedly brilliant"
[2] 2.5-3.5 if mostly repair-themed busywork; 4.0+ if it's diagnosis and reasoning
[3] depending on demo pacing

I wanted help choosing the next few demos so I gave ChatGPT a list of all the demos I had played and their demo ratings. I had also mentioned having written snippets available on each game, so it asked for those snippets for a handful of games. It next asked what I was looking to play next, so I gave it a list of unplayed demos I had already downloaded. Among the games, it recommended seven games (see "Round 1" in table above).

As a result, I went ahead and played Object Impermanence. I ended up giving it a 4.6, which lands within the range predicted by ChatGPT, but very close to the lower end of the range. In any case, I supplied my first impression notes of Object Impermanence to ChatGPT and it made an interesting observation:
So your top tier isn't merely "clever." Your top tier is:
- clever
- immediately engaging
- repeatedly surprising
Object Impermanence appears to have achieved the first but not the latter two.
For context, my top tier at this point were demos with ratings greater than or equal to 4.4: Demon Bluff (5.8), Club Soko (5.0), Character Limit (4.4), Func Key (5.0), and Locktale (4.8). In hindsight, I probably should have grouped Character Limit with the games with 4.0 and 4.2 ratings.

In any case, it went on to update its rankings, being less bullish on ReStory, The Sorting Bureau, and Miniature Land, and more bullish on Database Detective, Chicken Scratch, and Michi-Suji. What's interesting with this part of the response is that it never mentioned Miniature Land in its Round 1 top 7 list, which I would flag as an innocuous hallucination.

After playing Database Detective, I rated it a 5.4 (without awareness of what ChatGPT predicted). Interestingly, this score fell within range of its prediction. I was also surprised to notice that its Round 2 prediction was tighter than its Round 1 prediction. As an aside, it occurs to me that ChatGPT did not catch on that I only rate in multiples of 0.2. At which point I'm reminded that LLMs aren't that great with numbers, so all these ranges likely should be taken with a grain of salt.

Submitting my first impressions this time didn't have it make any changes to ratings or suggestions. Instead, it made the following remark:
The most surprising thing so far is that my original model was:
- User likes puzzles.
My current model is closer to:
- User likes being intellectually engaged in a way that produces surprise, preferably through deduction, inference, or knowledge application, and appreciates when a game has enough personality to make that process delightful.

It occured to me that the next suggestion, Chicken Scratch, was a party game. I informed ChatGPT and got updated recommendations (see Round 3). This didn't change the ranking of Michi-Suji, so I went on to try Michi-Suji. I ended up giving it a 3.8, which falls below the predicted ChatGPT range. Afterwards, I played The Sorting Bureau. Unfortunately, I quickly became bored of the game and gave it a 3.0. This also fell under the predicted ChatGPT range.

No comments :