value-nlp: A toolkit for Cross-Dialectal NLP

What is value-nlp?

value-nlp is a suite of resources for promoting fair and equitable NLP systems that are dialect invariant --- constant over dialect shifts to avoid allocative harms.

This package contains tools for systematically perturbing text with attested linguistic patterns from 50 varieties of English.


How can I use value-nlp?

Researchers can use this to accomplish the following:

📐   Benchmarking: NLP researchers can more comprehensively evaluate task performance across domain shifts.

⚖️   Bias and Fairness: Fairness, accountability and transparency researchers can more directly examine the ways NLP systems systematically harm disadvantaged or protected groups.

🌏   Linguistic Typology: Computational linguists can systematically understand the internal representations of large language models according to the literature on theoretical and field linguistics.

🌱   Low-resource NLP: Practitioners can adapt models to dialects which have limited labeled data by building on rich knowledge from dialectology.

Below is the unfolding value-nlp research story.

Perceptions of Language Technology Failures from South Asian English Speakers

This user-centric NLP study shows qualitatively that language technologies are not dialect invariant. They often fail for speakers of non-standard English varieties, misunderstanding both individual words and the syntax of phrases. Compared with speakers of Standard American English (SAE), speakers of non-standard varieties report more failures with written technologies.

survey_comparison_chart

Multi-VALUE: A Framework for Cross-Dialectal English NLP

These value-nlp stress test experiments show quantitatively that many text-based NLP systems are not dialect invariant, with notable performance drops for widely-spoken varieties like Indian English and Colloquial Singapore English. This work also shows how to address disparities with data augmentation.

survey_comparison_chart

DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules

DADA is an efficient fine-tuning method for achieving dialect invariance. This new methodology works by composing adapters which handle specific linguistic features.

survey_comparison_chart

Task-Agnostic Low-Rank Adapters for Unseen English Dialects

HyperLoRA is a parameter-efficient method for building dialect invariant NLP systems in a task-agnostic manner. HyperLoRA relies on hypernetworks to disentangle dialect-specific and cross-dialectal information.

survey_comparison_chart

How can I set up value-nlp?

Use the following demo to start using Multi-VALUE.

Installation

        pip install value-nlp
      

Demo

from multivalue import Dialects
southern_usa = Dialects.SoutheastAmericanEnclaveDialect()
print(southern_usa)

[VECTOR DIALECT] Southeast American enclave dialects (abbr: SEAmE)
      Region: United States of America
      Latitude: 34.2
      Longitude: -80.9

southern_usa.transform("I talked with them yesterday")

'I done talked with them yesterday'

southern_usa.executed_rules

{(2, 7): {'value': 'done talked','type': 'completive_done'}}


Acknowledgements

This is a collaborative effort across Stanford University, Georgia Tech, Harvard University, and Amazon AI.