BEADs Dataset (2024)

Bias Evaluation Across Domains (The Vector Institute)

3.67M rows | 2024 | The Vector Institute

The BEADs corpus was gathered from the datasets: MBIC, Hyperpartisan news, Toxic comment classification, Jigsaw Unintended Bias, Age Bias, Multi-dimensional news (Ukraine), Social biases.

It was annotated by humans, then with semi-supervised learning, and finally human verified.

It's one of the largest and most up-to-date datasets for bias and toxicity classification, though it's currently private so you'll need to request access through HuggingFace.

🤗Hugging Face Dataset (request access)

📑 Contents

Fields
Description

text

The sentence or sentence fragment.

dimension

Descriptive category of the text.

biased_words

A compilation of words regarded as biased.

aspect

Specific sub-topic within the main content.

label

Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased, and neutral.

toxicity

Indicates the presence (True) or absence (False) of toxicity.

identity_mention

Mention of any identity based on words match.

While BEADs doesn't have a binary label for bias, the ternary labels (e.g. neutral, slightly biased, and highly biased) of the label field can categorized into biased (1), or unbiased (0). Additionally, the toxicity field contains binary labels.

📄 Research Paper

Last updated