Meta’s new V‑JEPA 2 model is a milestone in self-supervised video learning. Because our own Simulation-Integrated Multimodal Language (SIML) project also builds on prediction-first principles, it’s worth sketching-at a high level-where the two lines of work resonate and how they might eventually meet. This post stays squarely in public territory: we summarize the open-access JEPA paper and point to general bridges without revealing details from our next paper.
🎞️ What JEPA Brings to the Table
A latent world model, not a pixel regressor.
JEPA (Joint Embedding Predictive Architecture) learns a compressed representation of video and is trained to predict future latent codes and masked spatio-temporal patches. The objective is to model what’s predictable, not what’s photorealistic.
Two-stage recipe.
- Stage 1: Pre-train a ViT-g on 1M hours of internet video and 1M images, action-free.
- Stage 2: Freeze the encoder and post-train a 300M parameter action-conditioned head (V‑JEPA 2‑AC) on just 62 hours of robot video.
The result? Zero-shot pick-and-place on Franka arms with no reward shaping.
Engineering details worth noting:
- 3D-RoPE: Temporal positional encodings stabilize long video training.
- Progressive resolution: Spatial and temporal scales increase through training.
- Mask-denoise: Generalized BERT-style masking over time and space.
🐛 SIML: Predictive Organisms in a Dirt World
Our SIML system powers embodied agents in voxel-based simulation-starting with synthetic earthworms in chemosensory soil. These agents optimize memory through “sleep” cycles that minimize free energy. States are encoded as bit-fields and nutrient vectors, and actions are chosen to reduce internal surprise-not maximize external reward.
(Specific methods are under submission-this is a conceptual overview.)
🚧 Current Challenges
-
Modality bridging
SIML includes non-visual sensory data (like pH and salinity). We’re exploring hybrid encoders that join ViT features with low-D vector streams. -
Compute tradeoffs
Rendering millions of simulation frames at high fps is costly. Latent-space pretraining or image compression pipelines may help. -
Evaluation mismatch
JEPA benchmarks focus on robot arms. SIML agents dig, digest, and adapt. Cross-domain metrics that reward general predictive structure are needed.
🔁 A Shared Future of Predictive Learning
The bigger picture: prediction-first self-supervision is fusing with embodied learning. Meta’s JEPA shows how general visual priors can enable real-world control. SIML explores how structured internal states evolve into digital cognition. The intersection-shared latent spaces, predictive codes, and surprise-minimizing agents-might just be the next step.
We’ll have more to say once the next paper drops. For now: we’re excited about the convergence.
- Richard Everts, Founder