🔬
The Fair-ly Project
  • Welcome to RumorMill
    • Recent Papers Timeline
  • Fair-ly Toolkit
    • Chrome Extension
    • Python Package
      • TextAnalyzer Pipeline
      • MultimodalAnalyzer Pipeline
    • Hosted APIs
  • Research
    • Sequence Classification
      • Binary
      • Multi-class
    • Named-Entity Recognition
      • Token Classification
    • Multimodal
      • Image + Text Pair Classification
    • Datasets
      • News Media Bias Plus (2024)
      • BEADs Dataset (2024)
      • GUS Dataset (2024)
      • BABE Dataset (2022)
  • Learn
    • Blog Posts
      • Training a model for multi-label NER
      • Binary Classification w/ BERT
  • Join the Project
    • To Do List
    • Discord Server
    • GitHub Repo
  • Misc
    • Privacy Policy
Powered by GitBook
On this page
Edit on GitHub
  1. Research
  2. Datasets

GUS Dataset (2024)

Generalizations, Unfairness, and Stereotypes Dataset (Ethical Spectacle Research)

PreviousBEADs Dataset (2024)NextBABE Dataset (2022)

Last updated 7 months ago

NER Dataset: 3.7k rows | 2024

Biased Corpus: 37.5k rows | 2024

The GUS dataset (released in the ), is an entirely synthetic dataset. The synthetic corpus was generated by Mistral 7B, and a random sample was labeled by GPT-4o (with a DSPy annotation pipeline) for multi-label token classification of the entities: Generalizations, Unfairness, and Stereotypes.

The underlying corpus is 37.5k rows, and contains multi-label type-of-bias (or aspect of bias) labels for each biased text sequence.

🤗Hugging Face Datasets

📑 NER Dataset Contents

Field
Description

text_str

The full text fragment where bias is detected.

ner_tags

Binary label, presence (1) or absence (0) of racial bias.

rationale

Binary label, presence (1) or absence (0) of religious bias.

📑 Biased Corpus Contents

Field
Description

biased_text

The full text fragment where bias is detected.

racial

Binary label, presence (1) or absence (0) of racial bias.

religious

Binary label, presence (1) or absence (0) of religious bias.

gender

Binary label, presence (1) or absence (0) of gender bias.

age

Binary label, presence (1) or absence (0) of age bias.

nationality

Binary label, presence (1) or absence (0) of nationality bias.

sexuality

Binary label, presence (1) or absence (0) of sexuality bias.

socioeconomic

Binary label, presence (1) or absence (0) of socioeconomic bias.

educational

Binary label, presence (1) or absence (0) of educational bias.

disability

Binary label, presence (1) or absence (0) of disability bias.

political

Binary label, presence (1) or absence (0) of political bias.

sentiment

The sentiment given to Mistral 7B in the prompt.

target_group

The group Mistral7B was told to prompt.

statement_type

Type of bias prompted (e.g. "stereotypes," "discriminatory language," "false assumptions," "offensive language," "unfair generalizations").

Mistral 7B was prompted to generate biased sentences, using the arguments in the table below. This means all sentences are intended to be biased. You may want to supplement the dataset with fair statements (with the same labels), if you're using it on unbiased text fragments.

📄 Research Paper

📊 Dataset Details (from the paper)

Ethical Spectacle Research
GUS-Net paper
ethical-spectacle/gus-dataset-v1 · Datasets at Hugging Facehuggingface
Logo
ethical-spectacle/biased-corpus · Datasets at Hugging Facehuggingface
Logo
GUS-Net: Social Bias Classification in Text with Generalizations,...arXiv.org
The GUS dataset is a random sample of the corpus (3739 rows), but this chart should also represent the distribution in the corpus
Logo