Image + Text Pair Classification

Classifying the use of an image with a text sequence as biased/fair.

Overview of Task

Though images are often used (for better, or for worse) in articles or social media posts, incorporating them to the bias or fake news classification pipeline is still relatively unexplored in research.

Image/text pair classification relies on the same type of text embeddings used in sequence classification and NER (created by a text encoder like BERT). This time, we also process images with an image encoder, then fuse the text and image encodings together, for classification tasks such as binary classification.

🤖 Models:

TruBIAS

Coming soon ;)

BERT + ResNet34 (guided by CLIP)

To preform the encoding required for both text and images, we can use pre-trained encoders for each, then use a classification head to classify the combined outputs (with concatenation, alignment, contrast, etc). The same text encoders from other tasks will work (usually BERT-based), and ResNet34 is common in literature (though other image encoders will also work). They'll preform feature extraction on both the text and images, to be fed into a classification head.

Aligning (or combining) the embeddings can be very sensitive. There are many methods and you're free to create your own, but concatenation tends to be the most reliable. Still, the raw embeddings will be on different scales (normalization helps).

An interesting approach is to also use a framework that's been trained on both text AND images, like CLIP, which have representations in the same embedding space. We can process the text/image pairs with BERT and ResNet34, alongside CLIP, to calculate a contrastive loss (how different they are). When combined with the classification loss of the output head, this can guide the two specialized encoders towards a shared embedding space during fine-tuning.

📄 Research Paper

Multimodal Fake News Detection via CLIP-Guided LearningarXiv.org

💻 Notebook to Train Your Own

Unexpected error with integration github-files: Integration is not installed on this space

💾 Datasets:

News Media Bias Plus (NMB+) Dataset

90k rows | 2024

The dataset includes around 90,000 news articles, curated from a broad spectrum of reliable sources, including major news outlets from around the globe, from May 2023 to September 2024. These articles were gathered through open data sources using Google RSS, adhering to research ethics guidelines.

NMB+ has images, and multi-modal labels for the text + image pair of each news article.

📑 Contents

Field

Description

unique_id

Unique identifier for each news item. Each unique_id is associated with the image (top image) for the same news article.

outlet

Publisher of the news article.

headline

Headline of the news article.

article_text

Full text content of the news article.

image_description

Description of the image paired with the article.

image

File path of the image associated with the article.

date_published

Publication date of the news article.

source_url

Original URL of the news article.

canonical_link

Canonical URL of the news article, if different from the source URL.

new_categories

Categories assigned to the article.

news_categories_confidence_scores

Confidence scores for the assigned categories.

text_label

Annotation for the textual content, indicating:

'Likely'or 'Unlikely'to be disinformation.

multimodal_label

Annotation for the combined text snippet (first paragraph of the news story) and image content, assessing:

'Likely'or 'Unlikely'to be disinformation.

Fine-tune Llama 3.2 Vision Instruct QLORA for image/text classification: 💻Notebook

Train your own VLM for bias detection: 💻(4) Notebooks

How it Works:

BERT (or other text encoder models) processes a text sequence into a encoding sequence, where self-attention heads encode the contextual words' meaning into each token representation.
ResNet (or other image encoder models) processes an image into a convolutional representation.
We combine/pool the text and image representations into one set of features that we can classify. There are many techniques, such as:
1. Concatenation: Plugging the representations together, one after another.
2. Dot product alignment: Using the dot product of the representations as the representation.
3. Fusion layer: a linear layer(s) to process the representations before classification.
The aligned embeddings are passed into a classification head, with an output logit, that is activated (typically with a sigmoid or softmax function), for a probability that falls between 0-1.
A threshold is sometimes applied to the output (e.g. probability > 0.5 is "Biased").

Metrics:

When evaluating models' performance at binary classification, you should try to understand the way positive (biased), negative (neutral) fall into the categories: correct (true) predictions, and incorrect (false) predictions.

Your individual requirements will guide your interpretation (e.g. maybe you REALLY want to avoid false positives).

Confusion Matrix: Used to visualize the levels of correct and incorrect classifications made, the goal
Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
F1 Score: $2 \times \frac{precision \times recall}{precision + recall}$

PreviousMultimodal NextDatasets

Last updated 8 months ago