The Metas Fundamental AI Research (FAIR) team introduced five projects to advance the company’s effort in building advanced machine intelligence (AMI).
The latest releases from Meta focus heavily on enhancing AI perception – the ability for machines to process and interpret sensory information – alongside advancements in language modelling, robotics, and collaborative AI agents.
Meta stated its goal involves creating machines “that are able to acquire, process, and interpret sensory information about the world around us and are able to use this information to make decisions with human-like intelligence and speed.”
The five new releases represent diverse but interconnected efforts towards achieving this ambitious goal.
Perception Encoder: Meta Enhances AI’s Visual Intelligence
Central to the new releases is the Perception Encoder, described as a large-scale vision encoder designed to excel across various image and video tasks.
Vision encoders function as the “eyes” for AI systems, allowing them to understand visual data.
Meta emphasises the growing difficulty in creating encoders which fulfil the criteria of modern AI systems by enabling integration of vision and language processing for images and videos under tough operational scenarios including adversarial attacks.
The ideal encoder, according to Meta, should recognise a wide array of concepts while distinguishing subtle details—citing examples like spotting “a stingray burrowed under the sea floor, identifying a tiny goldfinch in the background of an image, or catching a scampering agouti on a night vision wildlife camera.”
Meta claims the Perception Encoder achieves “exceptional performance on image and video zero-shot classification and retrieval, surpassing all existing open source and proprietary models for such tasks.”
Furthermore, its perceptual strengths reportedly translate well to language tasks.
An encoder linked with a large language model delivers better results compared to other vision encoders when used in tasks such as visual question answering, captioning, document understanding and grounding applications. It also reportedly boosts performance on tasks traditionally difficult for LLMs, such as understanding spatial relationships (e.g., “if one object is behind another”) or camera movement relative to an object.
“As Perception Encoder begins to be integrated into new applications, we’re excited to see how its advanced vision capabilities will enable even more capable AI systems,” Meta said.
PLM: Open Research at the Intersection of Vision and Language
PLM operates as a supplementary component to the encoder by handling complex visual recognition functions for open and reproducible vision-language tasks.
The training of PLM involved synthetic data alongside public vision-language datasets while explicitly omitting information transfer from private proprietary models.
Recognising gaps in existing video understanding data, the FAIR team collected 2.5 million new, human-labelled samples focused on fine-grained video question answering and spatio-temporal captioning. Meta claims this forms the “largest dataset of its kind to date.”
PLM is offered in 1, 3, and 8 billion parameter versions, catering to academic research needs requiring transparency.
Alongside the models, Meta is releasing PLM-VideoBench, a new benchmark specifically designed to test capabilities often missed by existing benchmarks, namely “fine-grained activity understanding and spatiotemporally grounded reasoning.”
Meta hopes the combination of open models, the large dataset, and the challenging benchmark will empower the open-source community.
Meta Locate 3D: Empowering Robotic Situational Awareness
Bridging the gap between language commands and physical action is Meta Locate 3D. This end-to-end model aims to allow robots to accurately localise objects in a 3D environment based on open-vocabulary natural language queries.
Meta Locate 3D processes 3D point clouds directly from RGB-D sensors (like those found on some robots or depth-sensing cameras). Given a textual prompt, such as “flower vase near TV console,” the system considers spatial relationships and context to pinpoint the correct object instance, distinguishing it from, say, a “vase on the table.”
The model consists of three components starting with a preprocessing module that generates 3D featurised point clouds from 2D features followed by the encoder-network sequence including 3D-JEPA encoder to produce contextualised 3D world representation then the Locate 3D decoder handles both 3D representation and language input to create output bounding boxes and masks of target objects.
The company accompanies its model release with an extensive dataset that uses referring expressions to detect objects. This dataset contains 130,000 linguistic markings spread throughout 1,346 scenes from ARKitScenes ScanNet and ScanNet++ which duplicates the current quantity of annotated data in this domain.
Meta believes robotic system development requires this technology because it enables the PARTNR robot project alongside enabling improved human-robot interaction and collaborative abilities.
Dynamic Transformers at the Byte Level: A Leap in Robust Language Modelling
Following research published in late 2024, Meta is now releasing the model weights for its 8-billion parameter Dynamic Byte Latent Transformer.
This architecture represents a shift away from traditional tokenisation-based language models, operating instead at the byte level. Meta claims this approach achieves comparable performance at scale while offering significant improvements in inference efficiency and robustness.
Traditional LLMs break text into ‘tokens’, which can struggle with misspellings, novel words, or adversarial inputs. Byte-level models process raw bytes, potentially offering greater resilience.
Meta reports that the Dynamic Byte Latent Transformer “outperforms tokeniser-based models across various tasks, with an average robustness advantage of +7 points (on perturbed HellaSwag), and reaching as high as +55 points on tasks from the CUTE token-understanding benchmark.”
By releasing the weights alongside the previously shared codebase, Meta encourages the research community to explore this alternative approach to language modelling.
Advancing Social Intelligence: Meta’s New Collaborative AI Reasoners
The final release, Collaborative Reasoner, tackles the complex challenge of creating AI agents that can effectively collaborate with humans or other AIs.
Meta focuses on human collaboration outcomes because it wants AI systems capable of performing tasks such as homework support and job interview preparation.
The collaboration needs effective problem-solving techniques as well as interpersonal competencies including communication skills together with empathy and feedback provision abilities and theory-of-mind perception abilities which develop through successive dialogue exchanges.
Modern LLM training protocols together with evaluation systems fail to account for social and collaborative components in their development processes. It is elaborate and costly to obtain suitable conversational data for analysis.
Collaborative Reasoner establishes a framework which evaluates and improves those abilities. The framework consists of goal-directed tasks which demand multistage logical processes that occur when two agents communicate through dialogue. The framework examines three competencies which involve constructive disagreement as well as partner persuasion to find mutually suitable solutions.
According to Meta’s analyses the available systems demonstrate limited success in utilising collaborative methods towards improved results. The authors suggest using synthetic interaction data between an LLM agent and itself for self-improvement.
The new high-performance model serving engine called Matrix enables scalable data generation for this purpose. The new approach demonstrated an increase of 29.4% in better results during maths and scientific tasks and social problem-solving sets compared to conventional ‘chain-of-thought’ operations from a sole LLM.
By open-sourcing the data generation and modelling pipeline, Meta aims to foster further research into creating truly “social agents that can partner with humans and other agents.”
These five releases collectively underscore Meta’s continued heavy investment in fundamental AI research, particularly focusing on building blocks for machines that can perceive, understand, and interact with the world in more human-like ways.