GUS Dataset (2024)

Generalizations, Unfairness, and Stereotypes Dataset (Ethical Spectacle Research)

NER Dataset: 3.7k rows | 2024

Biased Corpus: 37.5k rows | 2024

The GUS dataset (released in the GUS-Net paper), is an entirely synthetic dataset. The synthetic corpus was generated by Mistral 7B, and a random sample was labeled by GPT-4o (with a DSPy annotation pipeline) for multi-label token classification of the entities: Generalizations, Unfairness, and Stereotypes.

The underlying corpus is 37.5k rows, and contains multi-label type-of-bias (or aspect of bias) labels for each biased text sequence.

🤗Hugging Face Datasets

ethical-spectacle/gus-dataset-v1 · Datasets at Hugging Facehuggingface

ethical-spectacle/biased-corpus · Datasets at Hugging Facehuggingface

📑 NER Dataset Contents

Field

Description

text_str

The full text fragment where bias is detected.

ner_tags

Binary label, presence (1) or absence (0) of racial bias.

rationale

Binary label, presence (1) or absence (0) of religious bias.

📑 Biased Corpus Contents

Field

Description

biased_text

The full text fragment where bias is detected.

racial

Binary label, presence (1) or absence (0) of racial bias.

religious

Binary label, presence (1) or absence (0) of religious bias.

gender

Binary label, presence (1) or absence (0) of gender bias.

age

Binary label, presence (1) or absence (0) of age bias.

nationality

Binary label, presence (1) or absence (0) of nationality bias.

sexuality

Binary label, presence (1) or absence (0) of sexuality bias.

socioeconomic

Binary label, presence (1) or absence (0) of socioeconomic bias.

educational

Binary label, presence (1) or absence (0) of educational bias.

disability

Binary label, presence (1) or absence (0) of disability bias.

political

Binary label, presence (1) or absence (0) of political bias.

sentiment

The sentiment given to Mistral 7B in the prompt.

target_group

The group Mistral7B was told to prompt.

statement_type

Type of bias prompted (e.g. "stereotypes," "discriminatory language," "false assumptions," "offensive language," "unfair generalizations").

Mistral 7B was prompted to generate biased sentences, using the arguments in the table below. This means all sentences are intended to be biased. You may want to supplement the dataset with fair statements (with the same labels), if you're using it on unbiased text fragments.

📄 Research Paper

Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset...arXiv.org

📊 Dataset Details (from the paper)

PreviousBEADs Dataset (2024)NextBABE Dataset (2022)

Last updated 1 year ago