XAI and LLMs are often tools for accomplishing some other goal.
Very limited work has explored the utility of LLMs in use-case– specified user studies, but a user study on Microsoft/Github’s Copilot [1], an LLM-based code generation tool, found that it “did not necessarily improve the task completion time or success rate” [52]
LLM outputs often sound very confident, even if what they are saying is hallucinated [50]
When the user inquires about the incorrectness, they also have a documented tendency to argue that the user is wrong and that their response is correct. In fact, some have called LLMs “mansplaining as a service” [34]
This can make it more difficult for humans to implement cognitive checks on LLM outputs.
While some recent LLM work has outlined categories of failure modes for LLMs based on the types of cognitive biases use [29], we push for greater work in this field