When the AI is wrong: timing effects of simulated AI assistant error on trustworthiness and dependence
Fabio Mesters
Recent advances in AI capabilities are being harnessed to improve decision support systems in many areas of life and work. However, these AI systems are imperfect and there is no unified account of how errors they make interfere with human performance and trust in the system. For this project we focused on the effect of different error timings in a simulated chemical plant using a simulated AI assistant. Based on previous research we hypothesized that an error made by the AI early during the interaction would lead to a stronger decline in perceived trustworthiness than a late error. We further hypothesized that an early AI error would reduce dependence on the AI’s assessment more than a late error. Using a between-subjects design, we randomly assigned participants to either the early or late error condition. The task was to estimate the percentage of orange pixels in an image, which represented a reactive substance present in containers. Participants were asked for their initial estimate, after which the AI took some time to analyse the sample. Then, the AI presented its recommendation, and participants could enter their final estimate. Notably, only 10% of trials contained a faulty AI recommendation. Based on a-priori power analysis 230 participants were recruited using the online platform Prolific, of which an effective sample of N = 197 remained after applying exclusion criteria. Perceived trustworthiness was measured using the Trust in Automation scale (TiA) and dependence behaviour was measured with a Weight on Advice index, which quantifies how strongly participants adhered to the AI recommendation. Two-sample t-tests were conducted to investigate the differences in TiA scores and Weight on Advice between the two conditions . Perceived trustworthiness was found to be significantly higher when the error occurred early in the trial as opposed to late. However, there was no significant difference in dependence behaviour between the two conditions. In the future, we should investigate why dependence behaviour was not influenced by the error timing in this paradigm and if perhaps the errors themselves were not sufficiently salient to be noticed by most participants. Future research should also attempt to replicate our results using entirely different experimental paradigms.