onJun 11, 2023

Chat Bot Innovation

A conversational bot able to conduct phone conversations requires 3 components: a speech to text module, a text based conversational AI bot and a text to speech module. In our pilot study, we found that commercial speech to text technology was highly accurate and reasonably priced. For text to speech, recent advances in the field [11] have enabled convincing speech generation that is difficult to distinguish from human speech. There have also been significant recent advances with conversational bots on multiple fronts. The fluency of generated text is also now difficult to differentiate from human authored text [12], [13], with the main recognisable difference being thematic and factual consistency. For conversational AI, this consistency has also been substantially improved [6]. One final innovation to complete the ability for AI bots to mimic scam victims is the addition of personas [14]. These allow the bots to maintain consistent knowledge of personal facts such as a name, address and aspects of a fictitious personal life. On reviewing each of these advances we believe they make the building blocks of a sufficiently convincing mimic of a vulnerable human scam victim. Open source pre-trained bots such as the ParlAI “BlenderBot” [7] already combine these advances and can be readily adapted to our purpose.

Voice cloning is a type of “deep fake” consisting of deep learning AI models that generate speech audio that sounds like a given person from text inputs. The person whose voice is being cloned provides recordings of their voice which are used to train the AI model. Once sufficiently trained, arbitrary text can be provided to the model, and it will “speak” the text in the person’s voice. It is further possible to make variations on the voice to change e.g.: the apparent age and gender of the generated voice and modulate expressed emotion. Numerous publicly available “voice clones” exist and there are many commercially available voice clones and services in addition (e.g.: resemble.ai, ). In addition we will recruit volunteers to produce voice clones, which we will modify for a level of privacy protection and will explore voice clones from publicly available voice recordings. We expect to obtain at least 20 base voices, each with multiple variations. These voice clones and their variations will be one of the call duration optimization targets for our bots.

Conversational AI: At the core of our bots we will use conversational AI models built around large pre-trained sequence to sequence models such as BART [13], T5 [11], [12] or GPT [15], [16]. These models achieve very good fluency and have many variations available as open source tools (BART, T5 and GPT-2) or paid services (GPT-3). We will explore innovations built around these models such as improved short term memory [6], personas and empathy [14], [17]. We will further innovate with novel approaches to and combinations of these methods inspired by and tailored to our use case. These will be evaluated in controlled experiments (Task 3) as well as through analysis of actual scam calls “in the wild” (Task 1) in a virtuous cycle of innovation, tuning and deployment.

Fine tuning on conversation data such as scam call transcripts is a standard approach for domain adaptation of pre-trained conversational AI models that has been shown to be effective [18]. Our case presents novel challenges to fine-tuning due to long call durations (pilot data averaged 86 utterances) and the adversarial nature of our task (we are not seeking quality effective conversation, but to prolong the conversation irrespective of conversational quality).

Once “wild” data from calls with real scammers is available, a second form of training becomes possible. Our primary goal is for our bots to achieve long call durations with real scammers. We can use the duration of a “wild” call (one with a real scammer) as a reinforcement learning (RL) training objective [19], [20] with a small positive reward for each utterance and a large negative reward when the scammer hangs up. Annotations from Tasks 1 and 3 that relate directly to longer call durations may also be used as RL training objectives.

Our analysis of scam call data together with background knowledge of scammer methodologies and the psychology of persuasion provide further training targets. In particular, features associated with scammer script steps and those expected or found to be associated with ending or extending a call such as scammers’ negative emotion and threats. These would be incorporated into training as side tasks in addition to the main fine-tuning task. The intuition here is that a model that is able to distil the knowledge necessary to predict call features associated with longer “wild” call durations will be equipped to recognise model updates that are effective for achieving longer calls. It is well established in the literature that transfer learning through training on multiple related tasks is beneficial. There are several architectures that can be explored for implementing side tasks such as predicting from the last hidden layers of the underlying transformer or from the RL action space as used in [19], or the K-adapter framework [21].


[6] J. Xu, A. Szlam, and J. Weston, ‘Beyond Goldfish Memory: Long-Term Open-Domain Conversation’, ArXiv210707567 Cs, Jul. 2021, Accessed: Jul. 20, 2021. [Online]. Available: http://arxiv.org/abs/2107.07567

[7] ‘Blender Bot 2.0: An open source chatbot that builds long-term memory and searches the internet’. https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/ (accessed Nov. 25, 2021).

[11] J. Ao et al., ‘SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing’, ArXiv211007205 Cs Eess, Oct. 2021, Accessed: Nov. 15, 2021. [Online]. Available: http://arxiv.org/abs/2110.07205

[12] C. Raffel et al., ‘Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer’, J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020.

[13] M. Lewis et al., ‘BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension’, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, Jul. 2020, pp. 7871–7880. doi: 10.18653/v1/2020.acl-main.703.

[14] H. Song, Y. Wang, K. Zhang, W.-N. Zhang, and T. Liu, ‘BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data’, ArXiv210606169 Cs, Jun. 2021, Accessed: Jul. 29, 2021. [Online]. Available: http://arxiv.org/abs/2106.06169

[15] T. Brown et al., ‘Language Models are Few-Shot Learners’, in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, ‘Language Models are Unsupervised Multitask Learners’, p. 24, 2019.

[17] K. Shuster, D. Ju, S. Roller, E. Dinan, Y.-L. Boureau, and J. Weston, ‘The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents’, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, Jul. 2020, pp. 2453–2470. doi: 10.18653/v1/2020.acl-main.222.

[18] S. Roller et al., ‘Recipes for Building an Open-Domain Chatbot’, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, Apr. 2021, pp. 300–325. Accessed: Jul. 18, 2021. [Online]. Available: https://aclanthology.org/2021.eacl-main.24

[19] T. Zhao, K. Xie, and M. Eskenazi, ‘Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models’, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 1208–1218. doi: 10.18653/v1/N19-1123.

[20] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky, ‘Deep Reinforcement Learning for Dialogue Generation’, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, Nov. 2016, pp. 1192–1202. doi: 10.18653/v1/D16-1127.

[21] R. Wang et al., ‘K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters’, in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online: Association for Computational Linguistics, Aug. 2021, pp. 1405–1418. doi: 10.18653/v1/2021.findings-acl.121.