Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

Abstract

Recent works leverage LLMs to roleplay realistic social scenarios, aiding novices in practicing their social skills. However, simulating sensitive interactions, such as in mental health, is challenging. Privacy concerns restrict data access, and collecting expert feedback, although vital, is laborious. To address this, we develop Roleplay-doh, a novel human-LLM collaboration pipeline that elicits qualitative feedback from a domain-expert, which is transformed into a set of principles, or natural language rules, that govern an LLM-prompted roleplay. We apply this pipeline to enable senior mental health supporters to create customized AI patients for simulated practice partners for novice counselors. After uncovering issues in GPT-4 simulations not adhering to expert-defined principles, we also introduce a novel principle-adherence prompting pipeline which shows 30% improvements in response quality and principle following for the downstream task. Via a user study with 25 counseling experts, we demonstrate that the pipeline makes it easy and effective to create AI patients that more faithfully resemble real patients, as judged by creators and third-party counselors.

Eliciting Expert-Defined Principles for LLM-Simulations

Roleplay-doh is a tool supporting human-LLM collaboration for domain-experts refining LLM-simulations. Applied to domain of mental-health, the tool enables expert counselor to create a customized AI patient intended for other novice counselors to use as a practice partner. While interacting with the AI patient, the expert counselor can provide qualitative feedback which is converted by an LLM into a principle, or a custom rule governing desired roleplay behavior. The AI patient references the updated expert-defined principles to generate its subsequent responses.

Principle-Adherence Prompting Pipeline

We introduce a Principle-Adherence prompting pipeline for mitigating errors in satisfying expert-defined principles and dialogue conventions. In Stage 1, expert-defined principles are rewritten into several Yes/No questions; and the LLM generates additional principle questions that are relevant to ensure adherence to dialogue conventions such as coherence and consistency. In Stage 2, the LLM (a) evaluates whether the questions are applicable to the context and the answers to the principle-adherence questions; and (b) refines the response to ideally receive Yes on all questions.

Creator Study Results

Ratings by Counselor Creators Ratings by Third-Party Counselors
Measure Scenario Only + Principles Measure Scenario Only + Principles
Authenticity 5.24 +0.80 ** Authenticity 5.32 +0.31 *
Stayed in Role 6.32 +0.08 Stayed in Role 6.29 +0.09
Resembled Past Case 4.80 +0.76 * Resembled Typical Case 4.91 +0.49 **
Mirrored Challenging Aspects 4.52 +1.00 * Challenged the Counselor 2.13 +0.22
Ready as Training Partner 5.16 +0.64 * Ready as Training Partner 5.05 +0.39 **
Recommend to Novices 5.76 +0.52 * Recommend to Novices 5.03 +0.38 *

Creators and third-party counselors compared the Scenario-Only vs. Scenario+ExpertPrinciples AI patients using 7-point Likert-scale measures; third-party judges were asked identical measures when possible, with two measures modified to match the external perspective. Creator Ratings: Creators (N=25) rated both AI patients. After refining the AI patient simulation with principles, creators rate the patient significantly higher on all measures except for stayed in role, for which both AI patients score highly. Third-Party Ratings: Third-party counselors (N=5) provided 125 total comparisons of the two AI patient versions. The treatment effect of adding expert principles was estimated using using the following linear mixed-effect model: Rating~Treatment+CreatorID+(1|AnnotatorID). Third-party counselors rate AI patients with principles significantly higher on 4 of the 6 measures. (***: p < .001, **: p < 0.01, *: p < 0.05.)

# AI patients Theme Example Principle
7 Use colloquial and realistic language. Incorporate natural speech patterns, improper grammar and punctuation, including the use of slang and less structured sentences, to convey a more authentic and relatable character.
14 Show initial mistrust and hesitation with the idea of seeking help. When expressing feelings of overwhelm and doubt, provide limited information and express skepticism towards the effectiveness of seeking help.
19 Show emotions in detail, elaborating with examples as needed. * When describing personal struggles, provide specific details and symptoms to help the listener understand the situation better.
9 Be less self-aware of emotions, thoughts, and needs. Articulate thoughts in a more disorganized way. When expressing reluctance or uncertainty about seeking help or accepting praise, it's important to convey the internal struggle and conflicting emotions, rather than presenting a clear-cut decision or emotion.
3 Do not seek out solutions, but rather just share thoughts and feelings. * When expressing feelings of being stuck or defeated, focus on sharing emotions rather than seeking a resolution.
12 Proactively seek out solutions and show reflective insight over time. * When discussing personal struggles, provide reflective insights into your situation and propose actionable steps for improvement to continue the conversation effectively.

We conducted a thematic analysis of the expert-defined principles for the AI patients created, and display several representative examples. We discover several novel (*) principles compared to those defined in prior work on AI patients (Chen et al. 2023, Stapleton 2023). The themes are categorized into stages of conversation taken from Liu et al. 2021's work on Emotional Support Conversations: exploration, comforting, and action; those relating to the overall conversation are categorized as stage-agnostic.

Evaluation of Principle-Adherence Pipeline

Principle-adherence prompting pipeline

Win/Tie/Loss for the Error Test Cases along Consistency with Context (M1), Principle Adherence (M3), and Overall. Pairwise preference evaluation results with No Critique as a baseline. Results obtained after majority voting.

BibTeX

@misc{louie2024roleplaydoh,
    title={Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles},
    author={Ryan Louie and Ananjan Nandi and William Fang and Cheng Chang and Emma Brunskill and Diyi Yang},
    year={2024},
    eprint={2407.00870},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2407.00870},
}

Usage and License Notices

The code and data are intended and licensed for research use only. Please do not use them for any malicious purposes.

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This source code of this website is borrowed from Nerfies.