Perceptions of Language Technology Failures from South Asian English Speakers

1Georgia Institute of Technology, 2Stanford Unversity
summary-image
We survey 78 South Asian English (SAsE) speakers on their experiences and preferences with language technology and codify user responses into core challenges. We then construct intrinsic benchmarks of SAsE knowledge to address the pain-points identified through surveying (SAsE Lexical and Indian English Syntactic understanding). We evaluate 11 families of LLMs on these benchmarks. Researchers can use this work to guide the design of inclusive NLP systems to address user wants and benchmark their own systems.

Abstract

English NLP systems have empirically worse performance for dialects other than Standard American English (SAmE). However, how these discrepancies impact use of language technology by speakers of non-SAmE global Englishes is not well understood. We focus on reducing this gap for South Asian Englishes (SAsE), a macro-group of regional varieties with cumulatively more speakers than SAmE, by surveying SAsE speakers about their interactions with language technology and compare their responses to a control survey of SAmE speakers. SAsE speakers are more likely to recall failures with language technology and more likely to reference specific issues with written language technology than their SAmE counterparts. Furthermore, SAsE speakers indicate that they modify both their lexicon and syntax to make technology work better, but that lexical issues are perceived as the most salient challenge. We then assess whether these issues are pervasive in more recently developed Large Language Models (LLMs), introducing two benchmarks for broader SAsE Lexical and Indian English Syntactic understanding and evaluating 11 families of LLMs on them.

Background

NLP systems are primarily designed for speakers of Standard American English (SAmE) despite English being a diverse global language with many varieties beyond SAmE. As such, prior research has found empirically worse performance in NLP systems across dialects when compared to SAmE. However, the degree to which these discrepancies affect user experience is not well understood. This leaves open the question of whether reducing these gaps would have a noticeable and desirable impact on the speakers of these dialects. In the spirit of Blodgett et al., 2016, future work in dialect disparity should center on the lived experiences of those affected by language technology’s failures. Therefore, the next step in developing inclusive language technology is understanding non-SAmE speakers' lived experiences with NLP systems and how these experiences manifest in language technology.

summary-image
Global Varieties of English from eWAVE

In the spirit of Blodgett et al., 2016, future work in dialect disparity should center on the lived experiences of those affected by language technology’s failures. Therefore, the next step in developing inclusive language technology is understanding non-SAmE speakers' lived experiences with NLP systems and how these experiences manifest in language technology.

Proposed Solution

To understand and evaluate user pain-points, we follow a mixed-methods approach of a user-centric diagnostic study of language failures, benchmark creation corresponding to reported failures, and LLM evaluation on benchmarks. We focus specifically on speakers of South Asian Englishes (SAsE) as SAsE are a macro-group of regional English varieties with collectively more speakers than SAmE. Our user-centric study surveys 78 SAsE speakers. We find that SAsE speakers are significantly more likely to recall instances of language technology overall and written tech (as opposed to speech-based tech) fails more uniquely for SAsE speakers. Further, SAsE speakers consistently mention similar challenges, allowing us to codify language failures into core categories to guide future research.

"I think technologies should be designed in a way that they are able to understand ever[y] dialect." - Participant 1

We then develop a lexical benchmark and a syntactic benchmark to assess the relevance of these challenges to start-of-the-art text-based systems that may be used to evaluate future systems for inclusivity of SAsE. These benchmarks assess understanding for 317 loanwords, 724 stand-alone dialect terms, and 110 syntactic features. Finally, we evaluate 8 families of open-source LLMs and 3 providers of closed-source LLMs.

User-Centric Diagnostic Study

We extend prior work on user-centric surveying research to develop a survey to elicit how empirical NLP failures with dialect data impact user perceptions and interactions. Our survey aims to:

  1. Quantitatively assess the differences in language technology failures between SAsE and SAmE speakers.
  2. Gather qualitative feedback on user experiences and adaptations to better understand whether failure modes correspond to dialect usage.

We survey 78 SAsE speakers and 97 SAmE speakers (as a control) on Prolific.

Results

We find that a majority of both SAsE and SAmE participants recall instances when technology does not understand them well, but SAsE speakers do so 14% more often than SAmE speakers. The survey then probes participants for where they perceive technology failures. Here we find that SAsE speakers are significantly more likely than their SAmE counterparts to list at least one written technology such as ChatGPT, search engines, and Grammarly and significantly less likely to list at least one spoken technology such as Siri, Alexa, and automated phone services. This finding shows that the NLP performance failures found in prior works on text-based NLP are creating notably different user experiences for speakers of non-SAmE dialects.

survey_comparison_chart

Using keyword analysis, we find three common challenges emerge across SAsE participants. Notably, the challenge most frequently identified by participants (failures with stand-alone dialect words) diverges from the challenges emphasized in existing research.

Reported challenges, corresponding keywords, and percentage of occurrences.
Challenge Example Quote Occurrence
1 Failures with stand-alone dialect words "[I avoid using] some slang words. ‘Buggy’ instead of ‘shopping cart’ for example." - Participant 2 43%
2 Failures when switching between languages "I want to be able to speak bilingually with technology." - Participant 7 18%
3 Failures with colloquial dialect features “Language in for technology is so much more formal than spoken.” - Participant 19 20%

Benchmarks of SAsE Knowledge

Lexical

Existing benchmarks do not cover all of the reported challenge categories and notably omit stand-alone lexical variation! To address this gap, we create an instrinsic assessment of lexical understanding by pulling 724 stand-alone dialect terms from Wiktionary to address challenge #1. We pull an additional 317 loanwords from other South Asian Englishes to address challenge #2. e format these terms as multiple choice questions where the correct definition is placed alongside three incorrect definitions. The correct definition is the one provided by Wiktionary, while the incorrect definitions are randomly sampled from definitions of other terms.

make_graphic_lexical

Syntactic

To address challenge #3, we create a minimal-pair syntactic language evaluation with 110 sentences aligned between SAsE and SAmE augmented with aligned negative examples with syntax not attested in SAsE using rule-based transformations. The expectation is that the model should assign higher probability to the sentence which demonstrates syntax which has been attested in Indian English than it does to the sentence which does not demonstrate any acceptable SAsE syntax.

make_graphic_syntactic

Evalution of LLMs

Challenges #1 and #2

Across open-access models, 14 out of 15 models which achieve greater than 60% accuracy on the control set perform significantly worse on SAsE lexical knowledge overall. In general, models perform better on Challenge #1 with the exception of the first LLaMA models which perform better on loanwords (Challenge #2) at all scales. Furthermore, while 4 out of 6 industrial LLMs also have significantly worse performance for SAsE, GPT-4 and GPT-4-Turbo both achieve over 90% accuracy on this benchmark. The prevalence of the significantly lower performance across evaluations of Challenges #1 and #2 provides quantitative support for surveyed user perceptions, even in recently developed systems.

challenge_1_and_2_figure

Challenge #3

Challenge #3 esults are far more consistent across both model families and scales. Every model evaluated achieves near perfect results on the SAmE variant of the benchmark. Despite this, all models perform significantly worse on our SAsE benchmark with the best performance being 89% accuracy achieved by LLama 65B. The consistency of this trend across scales of both model size and training data volume indicates that scaling is unlikely to provide intrinsic understanding of valid SAsE syntax.

challenge_3_figure

Key Findings!

  1. While the majority of both groups recall issues with language technology, US-Based SAsE speakers do so 14% more often than SAmE speakers.
  2. Differences in user experience go beyond accent. While spoken language technology more frequently causes issues for both groups, more SAsE speakers report issues with written language tech than their SAmE counterparts.
  3. Users cite the most prescient pain-point as failures with standalone dialect words and report challenges with both words and syntax that have been attested in SAsE in free-form responses; users tend to remove such features to try and make technology work better.
  4. Benchmark results support user perceptions, showing a performance dip in user identified challenge categories in recent LLMs. These results indicate that empirical differences in SAsE NLP performance create different perceptions of written language technologies for SAsE speakers

Overall, we highlight the need to center user perspectives in the design and improvement of dialect-inclusive NLP and hope this work can aid in the development of more equitable language technology.

Limitations and Ethical Considerations

As a human subjects survey, this project was reviewed and approved by the lead authors’ Institutional Review Board.

Our study was constrained by the relatively small sample size due to the availability of participants. We also did not recruit participants outside of the US. In regards to the study of SAsE specifically, both individual varieties and speakers are influenced by many different regional, economic, and linguistic backgrounds; further research may reveal differences in user preferences between variants of SAsE and within each variety itself. We also note that neither of the authors in this study speak a variety of SAsE. This language limitation may have influenced our ability to fully understand and capture the perspectives of SAsE speaking participants. Lastly, we want to draw attention to less visible NLP systems which recommend content, target advertisements, and moderate platforms and are generally applied to users without their knowlegde. As such, surveying about user perceptions can easily underestimate the true extent of societal effects of pervasive NLP systems.

Acknowledgement

We would like to thank the reviewers and SALT lab members for their feedback, critique, and suggestions! We also thank Devyani Sharma for providing a valuable reading list at the start of this research. Computing resources for this project were in part provided through a Stanford Institute for Human-Centered Artificial Intelligence Google Cloud Credit Grant.

BibTeX


      @inproceedings{,
         title={Perceptions of Language Technology Failures
          from South Asian English Speakers},
         author={Faye Holt and William Held and Diyi Yang},
         booktitle={Findings of the Association for Computational Linguistics: ACL 2024},
         month={aug},
         year={2024}
      }