
Vision-Language-Action (VLA) models represent a breakthrough in embodied AI, combining visual perception, language understanding, and robotic control into unified models that can follow natural language instructions to perform physical tasks. This tutorial covers the complete process of fine-tuning VLA models for your specific robotic applications.
VLAs combine computer vision (interpreting images/videos), natural language processing (understanding and generating text), and action execution (interacting with environments or systems). This allows them to perceive, reason, and act based on both visual and textual inputs.
If you want to see VLAs in action, check out the Google DeepMind Robotics Lab Tour with Hannah Fry.
Popular VLA architectures include OpenVLA, RT-2 (Robotic Transformer 2), and PaLM-E.
Set up a DigitalOcean GPU Droplet and then SSHroot@[IPV4] in your terminal
In Terminal:
# Core dependencies
pip install torch torchvision
pip install transformers accelerate
pip install datasets
pip install wandb # for experiment tracking
pip install jupyter lab
jupyter lab
Key players: Vision Encoder: Processes camera images into visual embeddings Language Encoder: Converts text instructions into language embeddings Fusion Module: Combines visual and language information Action Decoder: Predicts robot actions from fused representations
Your dataset should contain episodes with:
{
"episode_0": {
"images": [img_0, img_1, ..., img_T], # shape: (T, H, W, 3) where T is total frames
"language": "pick up the red block",
"actions": [act_0, act_1, ..., act_T], # shape: (T, action_dim)
"success": True
}
}
from datasets import Dataset, DatasetDict
import numpy as np
def create_vla_dataset(episodes):
"""Convert robot episodes to HuggingFace dataset format"""
data = {
"images": [],
"language": [],
"actions": [],
"episode_id": []
}
for ep_id, episode in enumerate(episodes):
for t in range(len(episode['images'])):
data['images'].append(episode['images'][t])
data['language'].append(episode['language'])
data['actions'].append(episode['actions'][t])
data['episode_id'].append(ep_id)
return Dataset.from_dict(data)
# Split into train/val
dataset = create_vla_dataset(your_episodes)
dataset = dataset.train_test_split(test_size=0.1)
from transformers import AutoModel, AutoProcessor
import torch
# Load pre-trained VLA (example using OpenVLA)
model_name = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Freeze base model parameters (optional for LoRA)
for param in model.parameters():
param.requires_grad = False
Your robot’s action space likely differs from the pre-training data:
from torch import nn
class ActionHead(nn.Module):
def __init__(self, hidden_dim, action_dim):
super().__init__()
self.fc = nn.Sequential(
nn.Linear(hidden_dim, 512),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(512, action_dim)
)
def forward(self, x):
return self.fc(x)
# Add custom action head
model.action_head = ActionHead(
hidden_dim=model.config.hidden_size,
action_dim=7 # e.g., 6-DOF arm + gripper
)
Low-Rank Adaptation (LoRA) enables efficient fine-tuning by reducing the number of trainable parameters:
from peft import LoraConfig, get_peft_model
# Configure LoRA
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
#Expected output: trainable params: 8.3M || all params: 7B || trainable%: 0.12%
from torchvision import transforms
image_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
def normalize_actions(actions, stats):
"""Normalize actions to [-1, 1] range"""
return (actions - stats['mean']) / (stats['std'] + 1e-8)
def denormalize_actions(normalized_actions, stats):
"""Convert back to original action space"""
return normalized_actions * (stats['std'] + 1e-8) + stats['mean']
# Calculate statistics from your dataset
action_stats = {
'mean': dataset['train']['actions'].mean(axis=0),
'std': dataset['train']['actions'].std(axis=0)
}
from torch.utils.data import DataLoader
def collate_fn(batch):
"""Custom collate function for VLA data"""
images = torch.stack([processor.image_processor(b['images'])
for b in batch])
text = processor.tokenizer([b['language'] for b in batch],
padding=True, return_tensors="pt")
actions = torch.tensor([b['actions'] for b in batch],
dtype=torch.float32)
return {
'pixel_values': images,
'input_ids': text['input_ids'],
'attention_mask': text['attention_mask'],
'actions': actions
}
train_loader = DataLoader(
dataset['train'],
batch_size=32,
shuffle=True,
collate_fn=collate_fn,
num_workers=4
)
import torch.nn.functional as F
def vla_loss(predicted_actions, target_actions, reduction='mean'):
"""Action prediction loss (MSE for continuous actions)"""
mse_loss = F.mse_loss(predicted_actions, target_actions, reduction=reduction)
return mse_loss
# Alternative: Action chunking for temporal coherence
def chunked_action_loss(pred_chunks, target_chunks, chunk_size=10):
"""Predict multiple future actions at once"""
loss = 0
for i in range(chunk_size):
loss += F.mse_loss(pred_chunks[:, i], target_chunks[:, i])
return loss / chunk_size
from accelerate import Accelerator
from tqdm import tqdm
import wandb
# Initialize accelerator for distributed training
accelerator = Accelerator(mixed_precision='bf16')
# Setup
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10000)
# Prepare for distributed training
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, train_loader
)
# Initialize wandb
wandb.init(project="vla-finetuning", config={
"learning_rate": 1e-4,
"batch_size": 32,
"epochs": 10
})
# Training loop
global_step = 0
for epoch in range(10):
model.train()
epoch_loss = 0
for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
# Forward pass
outputs = model(
pixel_values=batch['pixel_values'],
input_ids=batch['input_ids'],
attention_mask=batch['attention_mask']
)
# Predict actions
predicted_actions = model.action_head(outputs.last_hidden_state[:, -1])
# Compute loss
loss = vla_loss(predicted_actions, batch['actions'])
# Backward pass
accelerator.backward(loss)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
# Logging
epoch_loss += loss.item()
global_step += 1
if global_step % 100 == 0:
wandb.log({
"loss": loss.item(),
"learning_rate": scheduler.get_last_lr()[0],
"epoch": epoch
})
print(f"Epoch {epoch+1} - Average Loss: {epoch_loss / len(train_loader):.4f}")
# Save checkpoint
if (epoch + 1) % 2 == 0:
accelerator.save_model(model, f"checkpoints/epoch_{epoch+1}")
def evaluate_in_simulation(model, env, num_episodes=50):
"""Evaluate model in simulation environment"""
model.eval()
success_count = 0
with torch.no_grad():
for episode in range(num_episodes):
obs = env.reset()
instruction = env.get_task_instruction()
done = False
while not done:
# Process observation
image = torch.tensor(obs['image']).unsqueeze(0)
text_inputs = processor.tokenizer(instruction, return_tensors="pt")
# Predict action
outputs = model(
pixel_values=image.to(model.device),
input_ids=text_inputs['input_ids'].to(model.device)
)
action = model.action_head(outputs.last_hidden_state[:, -1])
# Denormalize and execute
action = denormalize_actions(action.cpu().numpy(), action_stats)
obs, reward, done, info = env.step(action[0])
if info['success']:
success_count += 1
success_rate = success_count / num_episodes
print(f"Success Rate: {success_rate*100:.1f}%")
return success_rate
def evaluate_bc_metrics(model, val_loader):
"""Evaluate behavioural cloning performance"""
model.eval()
total_mse = 0
total_cosine_sim = 0
n_samples = 0
with torch.no_grad():
for batch in val_loader:
outputs = model(
pixel_values=batch['pixel_values'],
input_ids=batch['input_ids']
)
pred_actions = model.action_head(outputs.last_hidden_state[:, -1])
# MSE
mse = F.mse_loss(pred_actions, batch['actions'])
total_mse += mse.item() * len(batch['actions'])
# Cosine similarity
cosine_sim = F.cosine_similarity(pred_actions, batch['actions']).mean()
total_cosine_sim += cosine_sim.item() * len(batch['actions'])
n_samples += len(batch['actions'])
return {
'mse': total_mse / n_samples,
'cosine_similarity': total_cosine_sim / n_samples
}
# Save fine-tuned model
model.save_pretrained("./finetuned_vla")
processor.save_pretrained("./finetuned_vla")
# For deployment, merge LoRA weights
from peft import PeftModel
base_model = AutoModel.from_pretrained(model_name)
merged_model = PeftModel.from_pretrained(base_model, "./finetuned_vla")
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./deployed_model")
class VLAController:
def __init__(self, model_path):
self.model = AutoModel.from_pretrained(model_path)
self.processor = AutoProcessor.from_pretrained(model_path)
self.model.eval()
self.model.to('cuda')
@torch.inference_mode()
def predict_action(self, image, instruction):
"""Real-time action prediction"""
# Preprocess
inputs = self.processor(
images=image,
text=instruction,
return_tensors="pt"
).to('cuda')
# Predict
outputs = self.model(**inputs)
action = self.model.action_head(outputs.last_hidden_state[:, -1])
return action.cpu().numpy()[0]
# Usage
controller = VLAController("./deployed_model")
action = controller.predict_action(camera_image, "pick up the cup")
robot.execute_action(action)
RT-2 Paper and RT-2 website: This is the work that coined the term VLA (Vision-Language-Action model)
RoboMimic (for dataset handling)
Fine-tuning VLA models enables robots to perform specialized tasks with natural language control. We hope this tutorial shows you how you can adapt pre-trained models to your specific hardware and tasks, leveraging the power of large-scale pre-training while customizing for your application. Here, you want to start with simulation, iterate quickly, and gradually transition to real hardware as your model improves.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.