🔬
The Fair-ly Project
  • Welcome to RumorMill
    • Recent Papers Timeline
  • Fair-ly Toolkit
    • Chrome Extension
    • Python Package
      • TextAnalyzer Pipeline
      • MultimodalAnalyzer Pipeline
    • Hosted APIs
  • Research
    • Sequence Classification
      • Binary
      • Multi-class
    • Named-Entity Recognition
      • Token Classification
    • Multimodal
      • Image + Text Pair Classification
    • Datasets
      • News Media Bias Plus (2024)
      • BEADs Dataset (2024)
      • GUS Dataset (2024)
      • BABE Dataset (2022)
  • Learn
    • Blog Posts
      • Training a model for multi-label NER
      • Binary Classification w/ BERT
  • Join the Project
    • To Do List
    • Discord Server
    • GitHub Repo
  • Misc
    • Privacy Policy
Powered by GitBook
On this page
  • Overview of Task
  • 🤖 Models:
  • 💾 Datasets:
  • How it Works:
Edit on GitHub
  1. Research
  2. Multimodal

Image + Text Pair Classification

Classifying the use of an image with a text sequence as biased/fair.

PreviousMultimodalNextDatasets

Last updated 7 months ago

Overview of Task

Though images are often used (for better, or for worse) in articles or social media posts, incorporating them to the bias or fake news classification pipeline is still relatively unexplored in research.

Image/text pair classification relies on the same type of text embeddings used in and (created by a text encoder like ). This time, we also process images with an image encoder, then fuse the text and image encodings together, for classification tasks such as binary classification.


🤖 Models:

TruBIAS

Coming soon ;)

BERT + ResNet34 (guided by CLIP)

Aligning (or combining) the embeddings can be very sensitive. There are many methods and you're free to create your own, but concatenation tends to be the most reliable. Still, the raw embeddings will be on different scales (normalization helps).

📄 Research Paper

💻 Notebook to Train Your Own

Coming soon

Still lookin'

💾 Datasets:

News Media Bias Plus (NMB+) Dataset

90k rows | 2024

NMB+ has images, and multi-modal labels for the text + image pair of each news article.

📑 Contents

Field
Description

unique_id

Unique identifier for each news item. Each unique_id is associated with the image (top image) for the same news article.

outlet

Publisher of the news article.

headline

Headline of the news article.

article_text

Full text content of the news article.

image_description

Description of the image paired with the article.

image

File path of the image associated with the article.

date_published

Publication date of the news article.

source_url

Original URL of the news article.

canonical_link

Canonical URL of the news article, if different from the source URL.

new_categories

Categories assigned to the article.

news_categories_confidence_scores

Confidence scores for the assigned categories.

text_label

Annotation for the textual content, indicating:

'Likely'or 'Unlikely'to be disinformation.

multimodal_label

Annotation for the combined text snippet (first paragraph of the news story) and image content, assessing:

'Likely'or 'Unlikely'to be disinformation.

🤗HuggingFace Dataset (Request access)

Website (Official Docs)

📰 Blog Post


How it Works:

  1. BERT (or other text encoder models) processes a text sequence into a encoding sequence, where self-attention heads encode the contextual words' meaning into each token representation.

  2. ResNet (or other image encoder models) processes an image into a convolutional representation.

  3. We combine/pool the text and image representations into one set of features that we can classify. There are many techniques, such as:

    1. Concatenation: Plugging the representations together, one after another.

    2. Dot product alignment: Using the dot product of the representations as the representation.

    3. Fusion layer: a linear layer(s) to process the representations before classification.

  4. The aligned embeddings are passed into a classification head, with an output logit, that is activated (typically with a sigmoid or softmax function), for a probability that falls between 0-1.

  5. A threshold is sometimes applied to the output (e.g. probability > 0.5 is "Biased").

Metrics:

When evaluating models' performance at binary classification, you should try to understand the way positive (biased), negative (neutral) fall into the categories: correct (true) predictions, and incorrect (false) predictions.

Your individual requirements will guide your interpretation (e.g. maybe you REALLY want to avoid false positives).

  • Confusion Matrix: Used to visualize the levels of correct and incorrect classifications made, the goal

  • Precision: TPTP+FP\frac{TP}{TP + FP}TP+FPTP​

  • Recall: TPTP+FN\frac{TP}{TP + FN}TP+FNTP​

  • F1 Score: 2×precision×recallprecision+recall2 \times \frac{precision \times recall}{precision + recall}2×precision+recallprecision×recall​


To preform the encoding required for both text and images, we can use pre-trained encoders for each, then use a classification head to classify the combined outputs (with concatenation, alignment, contrast, etc). The same text encoders from other tasks will work (usually -based), and is common in literature (though other image encoders will also work). They'll preform feature extraction on both the text and images, to be fed into a classification head.

An interesting approach is to also use a framework that's been trained on both text AND images, like , which have representations in the same embedding space. We can process the text/image pairs with BERT and ResNet34, alongside CLIP, to calculate a contrastive loss (how different they are). When combined with the classification loss of the output head, this can guide the two specialized encoders towards a shared embedding space during fine-tuning.

The dataset includes around 90,000 news articles, curated from a broad spectrum of , including major news outlets from around the globe, from May 2023 to September 2024. These articles were gathered through open data sources using Google RSS, adhering to research ethics guidelines.

Fine-tune Llama 3.2 Vision Instruct QLORA for image/text classification:

Train your own VLM for bias detection:

BERT
ResNet34
CLIP
reliable
sources
💻Notebook
💻(4) Notebooks
sequence classification
NER
BERT
Multimodal Fake News Detection via CLIP-Guided LearningarXiv.org
vector-institute/newsmediabias-plus · Datasets at Hugging Facehuggingface
Logo
News Media Bias Plus
New multimodal dataset will help in the development of ethical AI systems - Vector Institute for Artificial IntelligenceVector Institute for Artificial Intelligence
Logo
Logo
Logo