Prompt phrasing on model performance, output quality

I’ve been thinking a lot recently how different styles of phrasing prompts can results in such drastically different outputs. While humans have all kinds of surrounding context cues, LLMs have to infer as much as possible from their single context. Some thing which come to mind are:

  • Previous experiences, out of scope from direct memory. Perhaps things a person has learned about you through others.
  • Facial expressions
  • Role/relationship to other party.

Memory is definitely a component here but things like body language & tone seem to hold much semantic meaning, and isn’t easily conveyed via text or even voice. (These semantics can’t be conveyed by tone, or expression alone but rather a combination human cues at our “inference time”

What do you guys think about this? Anyone doing interesting work here?

p.s. this post is mostly for testing psyche forums. sorry mods if this is wrong place for this

I can see that.

Though only anecdotal, it seems voice-to-voice models seem to have less variance in behavior/personality with vocal prompting than purely text-modal models.

I don’t recall any studies on this in particular, but perhaps could be a research direction would be contrasting voice-to-voice models to text-to-text models. Or even just voice-to-voice “prompting” techniques, maybe even using inflection, kind of person speaking, etc, and measuring the results.

There are some voice-to-voice papers on jail-breaking, but that is about the limits to my knowledge on this kind of research.

I know in general, it seems like adding multi-modality doesn’t necessarily improve text performance (and some people argue for smaller local models, multi-modality just wastes parameters). So even if there may be some rich data inside other modals of information, there hasn’t been a way to leverage it efficiently. At least this appears so, maybe there are some papers which show otherwise.