LayoutLM: A Powerful Model for Document Image Understanding
title: LayoutLM: A Powerful Model for Document Image Understanding date: 2024-12-21 slug: layoutlm authors: [kevinpeng] categories: - AI Assistants tags: - layoutlm - document processing description: In-depth analysis of LayoutLM document image understanding model. Microsoft's open-source multimodal AI model combining text and layout information, suitable for form processing, invoice recognition, and more.

1. Brief Introduction
In the digital age, we encounter countless documents daily—scans, forms, receipts, and more. Teaching computers to understand these documents containing both text and layout information has been a key research focus in AI. Traditional NLP models mainly focus on text content, ignoring document layout and visual information, which creates bottlenecks when processing document images. To solve this, Microsoft launched the LayoutLM model in June 2020.
- Background:
- Before LayoutLM, NLP models mainly focused on text input, while computer vision models focused on image input.
- LayoutLM was the first to use images, text, and 2D position information as input, achieving multimodal information processing.
- Development Team: LayoutLM was jointly developed by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou.
- Functionality:
- LayoutLM is designed to understand document images, enabling tasks like information extraction, form understanding, receipt understanding, and document classification.
- It significantly improves document image understanding performance by simultaneously modeling interactions between text and layout information.
- LayoutLM can extract specific, focused information from scanned documents or images.
2. Architecture Design
LayoutLM's architecture is based on BERT (Bidirectional Encoder Representations from Transformers). It adds two new input embeddings to BERT:
- 2D Position Embeddings: Used to represent the spatial position of text in documents. Unlike traditional position embeddings that only consider word order, 2D position embeddings use bounding box coordinates (x0, y0, x1, y1) for each word to define its position on the page. The top-left corner of the document is treated as the origin (0, 0) of the coordinate system. These coordinates are normalized to a 0-1000 range and then embedded into numerical representations the model can understand.
- Image Embeddings: Used to integrate visual information. LayoutLM segments images into regions corresponding to OCR text and uses visual features from these regions to generate image embeddings. Image embeddings help the model understand the document's visual style, enhancing document understanding.
Pre-training:
- LayoutLM uses Masked Visual-Language Model (MVLM) for pre-training. MVLM is a technique inspired by masked language models, but it considers both text and 2D position embeddings as input. The model learns to predict masked words using contextual text and spatial position information.
- LayoutLM also uses Multi-label Document Classification (MDC) for pre-training. This task trains LayoutLM to handle scanned documents with multiple labels, enabling it to aggregate knowledge from multiple domains and generate better document-level representations, though it's not necessary for large model pre-training.
- LayoutLM's pre-training used the IIT-CDIP Test Collection 1.0 dataset, containing over 6 million documents and 11 million scanned document images.
3. Document Types It Can Handle
LayoutLM excels at handling documents where layout and visual information are crucial for understanding content. These include:
- Forms: LayoutLM achieves excellent results on form understanding tasks, accurately handling structured documents with specific fields and layouts. The FUNSD dataset is commonly used to train and evaluate LayoutLM's form understanding capabilities.
- Receipts: LayoutLM also performs well on receipt understanding tasks. It can extract data from receipts using both text and layout information. The SROIE dataset is used to fine-tune LayoutLM for receipt data.
- Scanned Documents: LayoutLM effectively handles scanned documents, simultaneously modeling interactions between text and layout information.
- Business Documents: LayoutLM can be applied to various business documents, including:
- Purchase orders
- Financial reports
- Business emails
- Sales agreements
- Vendor contracts
- Letters
- Invoices
- Resumes
- Other Visually Rich Documents: LayoutLM is suitable for any visually rich documents where layout significantly enhances language representation.
4. Usage Tips
- OCR Engine: Use an OCR (Optical Character Recognition) engine (e.g., Tesseract) to extract text and corresponding bounding boxes from document images.
- Bounding Box Normalization: Before inputting bounding box coordinates to LayoutLM, normalize them to the 0-1000 range. Normalize by dividing bounding box coordinates by the document image's original width and height, then multiplying by 1000.
- Special Tokens: LayoutLM uses special tokens to handle text, including:
- [CLS]: Classification token, used for sequence classification and is the first token of the sequence.
- [SEP]: Separator token, used to separate multiple sequences.
- [PAD]: Padding token, used to pad sequences of different lengths.
- [MASK]: Mask token, used for masked language modeling.
- [UNK]: Unknown token, used to represent words not in the vocabulary.
- Choose the Right Tokenizer: Use LayoutLMTokenizer or LayoutLMTokenizerFast for tokenization. LayoutLMTokenizerFast is a faster version based on Hugging Face's tokenizers library.
5. Environment Requirements
LayoutLM's environment requirements include:
- Programming Language and Framework: LayoutLM can be implemented and trained using PyTorch or TensorFlow frameworks.
- PyTorch is an open-source machine learning library commonly used for implementing neural networks and deep learning models.
- TensorFlow is another popular open-source machine learning library also used for neural networks and deep learning models.
- Hugging Face Transformers Library: This is the core library for using LayoutLM, providing pre-trained models, tokenizers, and other tools.
- This library offers various LayoutLM model implementations, including variants for different tasks like LayoutLMModel, LayoutLMForMaskedLM, LayoutLMForSequenceClassification, LayoutLMForTokenClassification, and LayoutLMForQuestionAnswering.
- OCR Engine: Requires an OCR engine to extract text and corresponding bounding boxes from document images.
- Common OCR engine is Tesseract.
- OCR engine converts text in images to machine-readable text and provides coordinates needed for position embeddings.
- Image Processing Library: Requires image processing libraries like Pillow (PIL) to handle document images.
- Data Processing Libraries: Requires data processing libraries like NumPy and Pandas.
- Hardware Requirements: For model training, GPU can significantly accelerate training speed.
- Python Environment: Requires Python programming environment with necessary libraries installed.
- Tokenizer: Requires LayoutLMTokenizer or LayoutLMTokenizerFast for tokenization. LayoutLMTokenizerFast is a faster version based on Hugging Face's tokenizers library.
- Tokenizer is responsible for splitting text into tokens the model can understand.
- Datasets: Different tasks require different datasets. For example, FUNSD dataset for form understanding, SROIE dataset for receipt understanding, RVL-CDIP dataset for document image classification.
In summary, using LayoutLM requires a Python environment configured with appropriate libraries (Transformers, PyTorch or TensorFlow, OCR engine) and a platform capable of data preprocessing and model training.
6. Code Example
Here's a PyTorch code example using LayoutLM for sequence classification:
import os
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
import pytesseract
from PIL import Image, ImageDraw, ImageFont
import torch
from datasets import Dataset, Features, Sequence, ClassLabel, Value, Array2D
from transformers import LayoutLMTokenizer, LayoutLMForSequenceClassification, AdamW
# Load the dataset
# Assuming you have a dataframe named 'df' with columns 'image_path', 'words', 'bbox', 'label'
# The bounding box coordinates should be normalized
# Create a dictionary for label to index mapping
labels = df['label'].unique().tolist()
label2idx = {label: idx for idx, label in enumerate(labels)}
# Load the tokenizer and model
tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")
# Define a function to encode training examples
def encode_training_example(example, max_seq_length=512, pad_token_box=[0, 0, 0, 0]):
words = example['words']
normalized_word_boxes = example['bbox']
assert len(words) == len(normalized_word_boxes)
token_boxes = []
for word, box in zip(words, normalized_word_boxes):
word_tokens = tokenizer.tokenize(word)
token_boxes.extend([box] * len(word_tokens))
special_tokens_count = 2
if len(token_boxes) > max_seq_length - special_tokens_count:
token_boxes = token_boxes[: (max_seq_length - special_tokens_count)]
token_boxes = [0, 0, 0, 0] + token_boxes + [1000, 1000, 1000, 1000]
encoding = tokenizer(' '.join(words), padding='max_length', truncation=True)
input_ids = tokenizer(' '.join(words), truncation=True)["input_ids"]
padding_length = max_seq_length - len(input_ids)
token_boxes += [pad_token_box] * padding_length
encoding['bbox'] = token_boxes
encoding['label'] = label2idx[example['label']]
assert len(encoding['input_ids']) == max_seq_length
assert len(encoding['attention_mask']) == max_seq_length
assert len(encoding['token_type_ids']) == max_seq_length
assert len(encoding['bbox']) == max_seq_length
return encoding
# Function to prepare data loaders from dataframe
def training_dataloader_from_df(data_df):
dataset = Dataset.from_pandas(data_df)
features = Features({
'words': Sequence(Value('string')),
'bbox': Sequence(Sequence(Value('int64'))),
'label': Value('string'),
})
encoded_dataset = dataset.map(encode_training_example, features=features, remove_columns=dataset.column_names)
encoded_dataset.set_format(type='torch', columns=['input_ids','bbox', 'attention_mask', 'token_type_ids', 'label'])
dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=4, shuffle=True)
return dataloader
# Split train and validation datasets
train_data, valid_data = train_test_split(df, test_size=0.2, random_state=42)
# Create dataloaders
train_dataloader = training_dataloader_from_df(train_data)
valid_dataloader = training_dataloader_from_df(valid_data)
# Define the device to train on
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the model
model = LayoutLMForSequenceClassification.from_pretrained(
"microsoft/layoutlm-base-uncased", num_labels=len(label2idx)
)
model.to(device);
# Define optimizer
optimizer = AdamW(model.parameters(), lr=4e-5)
# Training loop
num_epochs = 3
for epoch in range(num_epochs):
print("Epoch:", epoch)
training_loss = 0.0
training_correct = 0
model.train()
for batch in tqdm(train_dataloader):
labels = batch["label"].to(device)
outputs = model(
input_ids=batch["input_ids"].to(device), bbox=batch["bbox"].to(device),
attention_mask=batch["attention_mask"].to(device),
token_type_ids=batch["token_type_ids"].to(device), labels=labels
)
loss = outputs.loss
training_loss += loss.item()
predictions = outputs.logits.argmax(-1)
training_correct += (predictions == labels).float().sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
print("Training Loss:", training_loss / batch["input_ids"].shape)
training_accuracy = 100 * training_correct / len(train_data)
print("Training accuracy:", training_accuracy.item())
validation_loss = 0.0
validation_correct = 0
model.eval()
with torch.no_grad():
for batch in tqdm(valid_dataloader):
labels = batch["label"].to(device)
outputs = model(
input_ids=batch["input_ids"].to(device), bbox=batch["bbox"].to(device),
attention_mask=batch["attention_mask"].to(device),
token_type_ids=batch["token_type_ids"].to(device), labels=labels
)
loss = outputs.loss
validation_loss += loss.item()
predictions = outputs.logits.argmax(-1)
validation_correct += (predictions == labels).float().sum()
print("Validation Loss:", validation_loss / batch["input_ids"].shape)
validation_accuracy = 100 * validation_correct / len(valid_data)
print("Validation accuracy:", validation_accuracy.item())
This example code demonstrates how to use LayoutLM for document classification. Note that input data must include text content (words), corresponding bounding box coordinates (bbox), and category labels (label), with bounding box coordinates normalized to the 0-1000 range.
7. FAQ
- What's the difference between LayoutLM and BERT?
- BERT mainly processes text information, while LayoutLM processes text, layout, and visual information simultaneously.
- LayoutLM integrates layout and visual information through 2D position embeddings and image embeddings, enabling better understanding of document images.
- How to handle document images of different sizes?
- By normalizing bounding box coordinates, LayoutLM can handle document images of various sizes.
- Can LayoutLM handle Chinese documents?
- LayoutLM can handle multilingual documents, including Chinese, provided appropriate tokenizers and pre-trained models are used.
- How to choose the right pre-trained model?
- Hugging Face Transformers library offers various pre-trained LayoutLM models. You can choose the appropriate model based on your task and data.
- How to improve LayoutLM performance?
- Use high-quality OCR results.
- Use fine-tuning data relevant to your task.
- Adjust model parameters like learning rate and training epochs.