BEADs Dataset (2024)
Bias Evaluation Across Domains (The Vector Institute)
Last updated
Bias Evaluation Across Domains (The Vector Institute)
Last updated
3.67M rows | 2024 | The Vector Institute
The BEADs corpus was gathered from the datasets: MBIC, Hyperpartisan news, Toxic comment classification, Jigsaw Unintended Bias, Age Bias, Multi-dimensional news (Ukraine), Social biases.
It was annotated by humans, then with semi-supervised learning, and finally human verified.
It's one of the largest and most up-to-date datasets for bias and toxicity classification, though it's currently private so you'll need to request access through HuggingFace.
🤗Hugging Face Dataset (request access)
📑 Contents
While BEADs doesn't have a binary label for bias, the ternary labels (e.g. neutral, slightly biased, and highly biased) of the label field can categorized into biased (1), or unbiased (0). Additionally, the toxicity field contains binary labels.
📄 Research Paper
Fields | Description |
---|---|
text
The sentence or sentence fragment.
dimension
Descriptive category of the text.
biased_words
A compilation of words regarded as biased.
aspect
Specific sub-topic within the main content.
label
Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased, and neutral.
toxicity
Indicates the presence (True) or absence (False) of toxicity.
identity_mention
Mention of any identity based on words match.