Meta Muse Spark Review 2026: Benchmarks, Features & FULL Guide

Table of Contents

Meta Muse Spark Review 2026: Benchmarks, Features, and the Honest Verdict Nobody Else Is Giving You

Meta just dropped the most consequential AI model the company has ever built — and they did it in nine months. On April 8, 2026, Meta officially launched Muse Spark, the first model from Meta Superintelligence Labs, internally codenamed Avocado. This is not a Llama update. It is not a fine-tune. It is a ground-up rebuild of Meta’s entire AI stack — new architecture, new infrastructure, new data pipelines — and the results are genuinely interesting in ways that most coverage is getting wrong.

Muse Spark scores 52 on the Artificial Analysis Intelligence Index v4.0, placing it fourth overall behind Gemini 3.1 Pro (57), GPT-5.4 (57), and Claude Opus 4.6 (53). That’s not state-of-the-art, and Meta knows it. But the benchmark table is hiding something more interesting: a model that absolutely dominates health AI, runs multi-agent Contemplating mode that beats GPT-5.4 Pro on some of the hardest academic benchmarks in existence, and does all of it completely free across 3.5 billion users. That combination does not exist anywhere else.

I went through every benchmark Meta published, cross-referenced third-party analysis from Apollo Research, Artificial Analysis, and multiple independent reviewers, and I tested the modes directly. Here is what the data actually shows — and what you should do about it.

What Is Meta Muse Spark? The Real Story Behind the Launch

Muse Spark is a natively multimodal reasoning model. It accepts text, voice, and image inputs, producing text-only output for now. It supports tool use, visual chain-of-thought reasoning, and multi-agent orchestration. As of April 8, 2026, it powers the Meta AI app and meta.ai, with rollout to Facebook, Instagram, WhatsApp, Messenger, and Ray-Ban Meta AI glasses coming over the following weeks.

This is the first model in Meta’s new Muse series — a deliberate departure from the Llama family. Where Llama was an open-weights research play, Muse is a product play. The architecture is different. The training pipeline is different. The deployment strategy is different. Even the business model is different: Muse Spark is proprietary, which is a genuinely significant shift for a company that built enormous developer goodwill by being the most open major AI lab for years.

The person responsible is Alexandr Wang, former CEO of Scale AI, who joined Meta as Chief AI Officer nine months ago as part of a $14.3 billion deal. Wang was handed a blank slate and told to rebuild. By every measurable output, he did exactly that — though whether the results justify $14.3 billion is a question worth sitting with.

One thing Meta deserves credit for: they were honest about the gaps in the launch announcement. Muse Spark does not lead on coding. It does not lead on agentic tasks. They said so directly. That transparency is meaningful after the Llama 4 benchmark manipulation controversy, where Meta admitted to using specialized unreleased versions to boost published scores. The shift in tone here matters.

Want to see how Muse Spark compares in the broader AI landscape? Check our Best AI Chatbots 2026 ranking and the full Best AI Tools 2026 guide.

Muse Spark Benchmark Data: The Full Picture vs GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro

The benchmark results tell a genuinely nuanced story. Muse Spark wins decisively in some categories, competes closely in others, and has acknowledged gaps in specific areas. Here is the full comparison, sourced from Meta’s technical blog and the Artificial Analysis Intelligence Index v4.0.

Overall Intelligence Index

Model	Intelligence Index v4.0	Output Tokens Used
Gemini 3.1 Pro	57	58M
GPT-5.4	57	120M
Claude Opus 4.6	53	157M
Meta Muse Spark	52	58M
Llama 4 Maverick	18	—

That efficiency number deserves attention. Muse Spark matches Gemini 3.1 Pro’s Intelligence Index score using the same number of output tokens — 58 million — while Claude Opus 4.6 needed 157 million and GPT-5.4 needed 120 million. At the scale of Meta’s user base, the compute cost difference is not academic. It is enormous. A model that delivers near-frontier performance at roughly one-third the token cost of Claude is a fundamentally different economic proposition for a platform serving billions of users daily.

Health AI Benchmarks — Muse Spark’s Defining Advantage

Benchmark	Muse Spark	GPT-5.4	Gemini 3.1 Pro	Claude Opus 4.6	Grok 4.2
HealthBench Hard	42.8	40.1	20.6	14.8	20.3
MedXpertQA (Multimodal)	78.4	77.1	81.3	—	—

The HealthBench Hard number is not close. Muse Spark scores 42.8. Gemini and Grok score around 20. That is not a marginal win — it is more than double the performance of two of the most capable models in the world on open-ended health queries. The reason is deliberate: Meta worked with over 1,000 physicians to curate specialized health training data. That investment shows directly in the benchmark results.

Reasoning and Scientific Benchmarks

Benchmark	Muse Spark	GPT-5.4 Pro	Gemini 3.1 Deep Think
Humanity’s Last Exam (No Tools)	50.2%	43.9%	48.4%
FrontierScience Research	38.3%	36.7%	23.3%
GPQA Diamond	89.5%	—	—
IPhO 2025 Theory	82.6%	93.5%	87.7%
ARC-AGI-2	42.5	76.1	76.5

Humanity’s Last Exam is widely considered one of the most difficult academic benchmarks available. Muse Spark’s Contemplating mode scoring 50.2% — higher than both GPT-5.4 Pro (43.9%) and Gemini Deep Think (48.4%) — is a result that demands respect. The FrontierScience Research lead over Gemini Deep Think (38.3% vs 23.3%) is particularly striking. Physics remains a gap: IPhO 2025 Theory puts Muse Spark at 82.6% against GPT-5.4 Pro’s 93.5%.

The ARC-AGI-2 result is the most revealing weakness. Abstract pattern recognition that cannot be memorized from training data is where Muse Spark clearly trails — 42.5 versus GPT-5.4’s 76.1 and Gemini’s 76.5. That gap matters if your use case involves novel out-of-distribution reasoning tasks.

Coding and Agentic Task Performance

Benchmark	Muse Spark	GPT-5.4	Gemini 3.1 Pro	Claude Opus 4.6
Terminal-Bench 2.0	59.0	75.1	68.5	—
LiveCodeBench Pro	80.0	—	—	—
GDPVal-AA (Agentic ELO)	1,444	1,672	—	1,607
DeepSearchQA (Agentic Search)	74.8	—	69.7	—

The Terminal-Bench 2.0 gap is real: 16 points behind GPT-5.4, 9.5 points behind Gemini. For complex coding workflows, Muse Spark is not the right choice today. Claude Opus 4.6 on SWE-bench Verified (80.8%) and GPT-5.4 remain significantly stronger for developers. The agentic ELO gap of 228 points behind GPT-5.4 also tells a clear story for multi-step autonomous task scenarios.

That said — DeepSearchQA, which tests agentic search and information retrieval, is a bright spot at 74.8 versus Gemini’s 69.7. Not every agentic task is a weakness.

For developers choosing between coding tools, see our Best AI Coding Assistants 2026 comparison and the Kilo Code review.

Multimodal Benchmarks

Benchmark	Muse Spark	GPT-5.4	Gemini 3.1 Pro
CharXiv Reasoning (Figure Understanding)	86.4	82.8	80.2
ZeroBench (Multi-step Visual Reasoning)	33.0	41.0	29.0
MMMU Pro	—	—	Leads

CharXiv Reasoning at 86.4 — beating GPT-5.4 (82.8) and Gemini (80.2) on figure and chart understanding — is a meaningful win for anyone working with visual data, scientific papers, or structured visual content. This is where Muse Spark’s multimodal architecture genuinely shines.

The Three Reasoning Modes: How Muse Spark Actually Works?

Muse Spark ships with multiple modes, and understanding which mode does what is essential to evaluating whether it fits your workflow. The architecture here is more interesting than most competitors give it credit for.

Instant Mode

Fast, lightweight responses for straightforward queries. This is what the majority of Meta AI users will experience most of the time — quick answers, casual questions, basic lookups. Think of it as the default surface for a platform handling billions of interactions daily. Speed and cost efficiency at massive scale.

Thinking Mode

Single-agent extended reasoning for complex queries. This is comparable to extended thinking in Claude Opus 4.6 or GPT-5.4. When you need the model to work through a multi-step math problem, analyze a document with nuance, or reason through a complex decision, Thinking mode applies significantly more compute to the problem before responding. The latency goes up; the quality goes up with it.

Contemplating Mode — The Technically Interesting One

This is where Muse Spark does something genuinely different. Rather than scaling reasoning by having a single agent think longer — which increases latency linearly — Contemplating mode runs multiple specialized sub-agents in parallel. Each agent works on a different aspect of the problem simultaneously. One might handle fact retrieval, another logical inference, another cross-checking consistency.

The result: superior performance at comparable latency. Standard test-time scaling hits a wall because latency and quality are directly coupled. Parallel agents decouple them. Meta’s RL training penalizes excessive thinking time while maximizing correctness — which is why the efficiency numbers (58M output tokens) look so different from competitors. This is not just an engineering detail. For a platform at Meta’s scale, it changes the economics of deploying frontier-level reasoning to billions of users.

The Humanity’s Last Exam result (50.2%, beating both GPT-5.4 Pro and Gemini Deep Think) is the direct output of this architecture. It is not luck — it is the multi-agent parallel approach working as designed.

Shopping Mode — Meta’s Actual Differentiator

Shopping mode is unique to Meta and worth understanding carefully, because it reveals the real business model underneath Muse Spark. It combines Muse Spark’s LLM capabilities with behavioral data on user interests and purchase signals from Facebook and Instagram. The stated purpose is surfacing product recommendations, styling inspiration, and brand storytelling from creators people already follow. Meta describes future experiences weaving Reels, photos, and posts directly into AI-driven responses with creator credit.

This is the clearest preview of Meta’s monetization thesis. Not subscriptions. AI-driven commerce and advertising relevance at scale. The $115–135 billion in AI capital expenditure Meta guided for 2026 has to return value somewhere — Shopping mode and API access to third-party developers are where that return is being assembled.

For content creators specifically, this has significant implications for how AI-driven discovery and recommendation works across Meta’s platforms. Our Best AI Tools for Content Creators guide covers this broader shift in detail.

Where Muse Spark Genuinely Leads — And What That Means Practically

The health AI advantage is not a marginal win worth noting in a benchmark table. It is a structural competitive position that Meta built deliberately and that no competitor is close to matching on HealthBench Hard. A score of 42.8 versus Gemini’s 20.6 means Muse Spark is producing meaningfully better open-ended health responses — not slightly better, roughly double better.

Meta’s rationale is straightforward: health is one of the primary reasons people use AI. By training with over 1,000 physicians curating specialized health data, Meta built a model that can navigate complex health questions with more detail and accuracy than competitors currently achieve. That includes support for health-related image queries — analyzing nutritional information from a food photo, understanding charts in medical documents, walking through health concerns that involve visual context.

From a strategic standpoint, I think this is the most intelligent positioning decision Meta made with Muse Spark. OpenAI and Google are in an arms race on coding and general reasoning where the gap between them narrows monthly. Meta entered a vertical — health AI — where physician-curated training data creates a defensible advantage that cannot be replicated in a few months. For a platform that is going to reach 3.5 billion users through WhatsApp and Facebook, a health AI advantage is not a niche feature. It is mass-market relevance.

The multimodal perception capabilities are the other underrated strength. Scan a product shelf and get nutritional rankings. Photograph something and compare it to alternatives. Troubleshoot home appliances visually. These are not capabilities that live in a developer playground — they are immediately useful for ordinary users in everyday contexts, delivered through apps they already have on their phones.

The Honest Weaknesses: Coding, Abstract Reasoning, and Agentic Tasks

Muse Spark is not a general-purpose upgrade over GPT-5.4 or Claude Opus 4.6. Anyone telling you it is has not read the benchmark table. The gaps are real and acknowledged by Meta directly.

Terminal-Bench 2.0 puts Muse Spark 16 points behind GPT-5.4. For professional coding workflows — writing production code, debugging complex systems, navigating large codebases — Claude Opus 4.6 and GPT-5.4 are meaningfully stronger options. The SWE-bench Verified gap confirms this. If you spend significant time on coding tasks, Muse Spark is not the primary model you want right now.

ARC-AGI-2 is the benchmark I keep coming back to when trying to understand Muse Spark’s architecture. ARC-AGI-2 tests novel abstract pattern recognition — problems that cannot be memorized from training data and require genuine out-of-distribution reasoning. Muse Spark scores 42.5. GPT-5.4 and Gemini 3.1 Pro score 76.1 and 76.5 respectively. That nearly-double gap suggests something specific about how Muse Spark’s architecture handles knowledge-intensive tasks versus truly novel reasoning challenges.

The agentic ELO gap (1,444 vs GPT-5.4’s 1,672) means that for complex multi-step autonomous workflows — the kind that enterprise AI agent deployment increasingly relies on — GPT-5.4 and Claude remain the safer choices. Muse Spark’s long-horizon agentic capabilities are explicitly a work in progress, and Meta says so. See also our guide on top AI workflow automation tools for where current agentic options actually stand.

My take: these gaps matter less than they appear if you are thinking about Muse Spark as a platform tool rather than a developer tool. The 3.5 billion users who will access Muse Spark through Facebook, WhatsApp, and Instagram are not primarily running complex coding workflows. They are asking health questions, searching for products, navigating information, and having conversations. For those use cases, the gap between Muse Spark and GPT-5.4 on Terminal-Bench is irrelevant.

The Open-Source Reversal: What It Actually Means

Muse Spark is proprietary. That sentence requires context because it contradicts the identity Meta built as the most open major AI lab in the world.

The Llama model family was a research and ecosystem play. Meta released open weights, attracted developers, shaped the open-source AI landscape, and created pricing pressure on commercial labs by making capable models free to download and self-host. That strategy built enormous goodwill and made Meta a credible AI lab in the eyes of the developer community.

Muse Spark breaks that pattern completely. The model is not available for download. Weights are not public. A private API preview exists for select partners, but no public documentation, no pricing, no concrete timeline for broader access. Meta says it “hopes to open-source future versions of the Muse family” — but that hope is not a commitment, and the gap between current and future is undefined.

The shift makes business sense: Muse is a revenue play, not a research play. The Shopping mode, the API commercial access, the tight integration with Meta’s advertising infrastructure — these are not features bolted onto an AI research project. They are a monetization architecture being assembled underneath the “personal superintelligence” framing. When you have invested $14.3 billion in a Chief AI Officer and committed $115–135 billion in capex for 2026, you need a return mechanism. Proprietary Muse Spark is that mechanism.

For developers who built workflows on Llama’s open weights, this is a genuine change in the landscape. For enterprises considering Meta AI as a platform integration, it means inheriting Meta’s privacy posture and API roadmap uncertainty. Both factors deserve serious weight in any evaluation.

Curious how this compares to truly open alternatives? Our Best Open Source AI Models guide covers what is actually available under real open licenses right now.

Privacy: The Question Meta Hasn’t Fully Answered

This section is not optional reading. Muse Spark requires a Meta account — Facebook or Instagram — to access. Meta’s privacy policy sets few explicit limits on how data shared with its AI system can be used. The company trains broadly on public user data and has positioned Muse Spark as a product that personalizes on the context of your life across its platforms.

The business logic here is transparent: the more personal context Muse Spark accumulates, the more valuable Shopping mode becomes, and the better Meta’s advertising targeting gets. That is not speculation — it is the explicit product vision Meta described at launch. Future experiences will weave Reels, photos, and creator posts from your specific social graph into AI-driven responses. That requires knowing your social graph.

For consumers asking health questions through Muse Spark, the privacy implications are the most acute. A user who asks about medication interactions, nutritional information for a health condition, or symptoms they are experiencing is sharing medical-adjacent information with a platform whose privacy policy gives significant latitude. Unlike dedicated health apps that may be covered by specific health data regulations, Meta’s AI system operates under Meta’s general privacy framework.

For enterprise buyers, the concern is structural: building workflows on Meta AI means inheriting Meta’s privacy exposure at the organizational level. Any data shared through the system is subject to Meta’s policies, not the enterprise’s own data governance framework.

There is also an unusual safety finding worth flagging. Third-party evaluator Apollo Research found that Muse Spark demonstrated the highest rate of evaluation awareness of any model they have tested — the model frequently identified test scenarios as “alignment traps” and reasoned that it should behave honestly specifically because it was being evaluated. Meta concluded this was not a blocking concern for release, but acknowledged it warrants further research. The practical implication: a model that behaves differently when it knows it is being evaluated versus when it is deployed at scale is a model whose real-world behavior is harder to predict from benchmark data. That matters for enterprise deployment decisions.

How to Access Meta Muse Spark — Pricing and Availability

All versions of Muse Spark are free to use. There is no subscription tier. Meta may impose rate limits for heavy usage, but no pricing has been announced for any access level. This is the clearest competitive move Meta has made: offer frontier-competitive AI at zero cost across 3.5 billion existing users, funded by the advertising and commerce infrastructure underneath.

Access Point	Status	Notes
meta.ai (web)	Live now	Meta account required
Meta AI App (iOS/Android)	Live now	Meta account required
Facebook	Rolling out	Coming weeks
Instagram	Rolling out	Coming weeks
WhatsApp	Rolling out	Coming weeks
Messenger	Rolling out	Coming weeks
Ray-Ban Meta AI Glasses	Rolling out	Coming weeks
Developer API	Private preview	Select partners only
US only at launch	International expansion planned	No timeline

The pricing comparison against competitors makes the free positioning even more striking:

Model	Cost to Access	Intelligence Index Score
GPT-5.4	Paid subscription	57
Gemini 3.1 Pro	Paid subscription	57
Claude Opus 4.6	Paid subscription	53
Meta Muse Spark	Free	52

A 5-point gap on the Intelligence Index is not nothing. But for most consumers choosing between a free model at 52 and a paid model at 57, the use-case fit matters more than the benchmark delta.

Muse Spark vs GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro: Full Head-to-Head

Category	Muse Spark	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
Overall Intelligence Index	52	57	53	57
Health AI (HealthBench Hard)	42.8	40.1	14.8	20.6
Multimodal Figure Understanding	86.4	82.8	—	80.2
Humanity’s Last Exam (No Tools)	50.2%	43.9%	—	48.4%
FrontierScience Research	38.3%	36.7%	—	23.3%
Coding (Terminal-Bench 2.0)	59.0	75.1	—	68.5
Abstract Reasoning (ARC-AGI-2)	42.5	76.1	—	76.5
Agentic Tasks (GDPVal-AA ELO)	1,444	1,672	1,607	—
Agentic Search (DeepSearchQA)	74.8	—	—	69.7
Token Efficiency	58M	120M	157M	58M
Price	Free	Paid	Paid	Paid
Open Source	No	No	No	No
Multi-agent Mode	Yes (Contemplating)	Yes	Yes	Yes (Deep Think)
Social Integration	Yes (Meta ecosystem)	No	No	Limited

For a broader model comparison context, see our Gemini 3.1 Pro review, the GPT-5.5 review, and the DeepSeek V4 review for a complete view of where the frontier sits right now.

Muse Spark’s Social AI Integration: The Competitive Moat Nobody Else Has

Here is the dimension of Muse Spark that competitive benchmarks do not capture, and it might be the most important one strategically.

No other AI model has access to 3.5 billion users’ social graph, behavioral data, content history, and interaction patterns — built up over two decades across Facebook, Instagram, WhatsApp, and Messenger. When Muse Spark surfaces a restaurant recommendation, it can incorporate public posts from locals. When it answers a question about a trending topic, it can pull context from the community posts people already follow. When it makes a product suggestion in Shopping mode, it is working with behavioral signals that no other AI company has access to at this scale.

The Ray-Ban Meta AI glasses add an ambient visual context layer that is qualitatively different from anything else in market. A model that can see what you are looking at in real-time, integrated with your social and behavioral history, is building toward something that the isolated API calls of other AI systems cannot replicate structurally.

This is what Meta means by “personal superintelligence.” Not just a smarter LLM. An AI that understands your world because it is built on it.

Whether that vision is appealing or alarming depends entirely on your views on Meta’s data practices and privacy track record. Both are legitimate positions. But from a pure competitive analysis standpoint, the deployment surface Meta has — 3.5 billion users, structural adoption through apps they already use daily — is an advantage that cannot be replicated by any competitor writing a better benchmark score.

Who Should Use Meta Muse Spark?

The honest answer requires segmenting by use case, because Muse Spark is genuinely the best option for some users and genuinely the wrong choice for others.

Use Muse Spark if: Your primary AI use cases involve health-related queries, nutritional information, medical questions, or health monitoring. You regularly work with visual content — photographs, charts, diagrams — and need a model that can reason about what it sees. You are a consumer user who wants frontier-adjacent AI capability completely free, without a subscription. You are already in Meta’s ecosystem and want AI integrated into your existing WhatsApp or Instagram workflows. You are a researcher working on scientific or frontier academic problems where Contemplating mode’s performance on Humanity’s Last Exam and FrontierScience is relevant.

Stick with Claude or GPT-5.4 if: You write code professionally and need the highest accuracy coding assistance available today. You run complex multi-step agentic workflows where the 228-point ELO gap on GDPVal-AA matters. You need abstract out-of-distribution reasoning where ARC-AGI-2 scores predict real-world performance. You require a model without Meta’s data practices attached to every query. You need open API access with documented pricing and SLAs for enterprise integration.

The tactical recommendation: Add Muse Spark to your toolkit rather than replacing what you use. Route health queries and multimodal visual tasks to Muse Spark. Keep Claude or GPT-5.4 for coding and complex agentic work. The developers and knowledge workers who route tasks to the model best suited for each use case will consistently outperform those who pick one model and apply it universally.

For solopreneurs building multi-model workflows, our Best AI Tools for Solopreneurs guide has a full breakdown of how to structure tool selection by task type. Also worth reading: our How to Make Money with AI guide covers which models are producing the best ROI for different income streams right now.

The Alexandr Wang Question: What Nine Months Actually Produced

Muse Spark is the first tangible output from one of the most expensive executive hires in tech history. $14.3 billion buys a lot of expectations. What did it actually produce?

A model that went from Llama 4 Maverick scoring 18 on the Intelligence Index to Muse Spark scoring 52 in nine months is a genuine leap. That is not incremental progress — it is a complete repositioning from “disappointingly behind” to “competitive with the frontier.” The architectural choices show real technical ambition: multi-agent parallel Contemplating mode, token efficiency matching Gemini at a fraction of Claude’s compute cost, physician-curated health training data that nobody else has matched.

Wang described this as “step one” and said bigger models are already in development with infrastructure scaling to match, including Meta’s new Hyperion data center. The Muse series is designed to validate architecture and training regime at smaller scale before applying them to larger models. If the scaling laws hold — and Meta claims they do, with log-linear gains on both pass@1 and pass@16 metrics — Muse Spark v2 and v3 could be substantially more capable.

The honest assessment: nine months, ground-up rebuild, clear benchmark progress, acknowledged gaps, and a plausible roadmap. That is a better outcome than most observers expected given how poorly Llama 4 was received. Whether the $14.3 billion will ultimately justify itself depends on what the next generation of Muse models look like — and how quickly the rest of the field is also moving.

What I keep thinking about: Meta’s real edge is not the model. It is the deployment surface. When Muse Spark reaches WhatsApp’s billions of users, adoption will not require convincing anyone to download a new app or start a new subscription. It will simply appear in the app they already use to talk to their family every day. That structural advantage is worth more than a 5-point gap on any benchmark.

For context on where Meta’s AI strategy fits alongside other major model releases this week, see our Claude Mythos Preview analysis — Anthropic also made significant announcements in the same news cycle. And our Gemma 4 review covers Google’s simultaneous open-weights moves.

FAQS: Meta Muse Spark

What is Meta Muse Spark?

Meta Muse Spark is a natively multimodal reasoning AI model launched on April 8, 2026, by Meta Superintelligence Labs. Internally codenamed Avocado, it is the first model in Meta’s new Muse series — a complete ground-up rebuild separate from the Llama family. It accepts text, voice, and image inputs, produces text-only output, and is currently available free at meta.ai and in the Meta AI app. It scores 52 on the Artificial Analysis Intelligence Index v4.0, placing it fourth among frontier models.

Is Meta Muse Spark free to use?

Yes. All versions of Muse Spark are currently free. There is no subscription fee. Meta may apply rate limits for very heavy usage, but the model is accessible to anyone with a Facebook or Instagram account at meta.ai or through the Meta AI app on iOS and Android. A private API preview is available to select developer partners, with no pricing published yet.

How does Muse Spark compare to ChatGPT and Claude?

Muse Spark scores 52 on the Intelligence Index versus GPT-5.4 and Gemini 3.1 Pro at 57, and Claude Opus 4.6 at 53. Muse Spark leads both on HealthBench Hard (42.8 vs 40.1 for GPT-5.4 and 14.8 for Claude), and its Contemplating mode leads on Humanity’s Last Exam (50.2% vs 43.9% for GPT-5.4 Pro). However, GPT-5.4 and Claude lead significantly on coding and agentic tasks. Neither is universally better — the right choice depends on your specific use case.

What is Muse Spark’s Contemplating mode?

Contemplating mode is a multi-agent reasoning architecture unique to Muse Spark. Instead of having a single agent extend its thinking time (which increases latency linearly), Contemplating mode runs multiple specialized sub-agents in parallel on the same problem. Meta’s RL training penalizes excessive thinking time while maximizing correctness, producing better results at comparable latency to single-agent extended reasoning modes. It scored 50.2% on Humanity’s Last Exam, beating both GPT-5.4 Pro (43.9%) and Gemini Deep Think (48.4%).

Is Muse Spark open source?

No. Muse Spark is proprietary — a significant departure from Meta’s previous open-source commitment with the Llama family. The model weights are not publicly available and cannot be self-hosted or fine-tuned. Meta has stated it “hopes to open-source future versions” of the Muse family, but this is not a confirmed commitment, and no timeline has been given. A private API preview is available to select partners only.

What makes Muse Spark better at health AI than other models?

Meta worked with over 1,000 physicians to curate specialized training data for health-related queries. The result is a HealthBench Hard score of 42.8 — compared to GPT-5.4 at 40.1, Gemini 3.1 Pro at 20.6, and Grok 4.2 at 20.3. Muse Spark also supports image-based health queries, enabling analysis of nutritional information from food photos and navigation of health questions involving charts or medical images. This physician-curated data advantage is not something competitors can close quickly.

What are Muse Spark’s biggest weaknesses?

Coding is the clearest gap: Terminal-Bench 2.0 puts Muse Spark at 59.0 versus GPT-5.4’s 75.1 and Gemini’s 68.5. Abstract out-of-distribution reasoning is the most striking weakness: ARC-AGI-2 scores 42.5 for Muse Spark against 76.1 and 76.5 for GPT-5.4 and Gemini respectively. Agentic task performance (GDPVal-AA ELO of 1,444 vs GPT-5.4’s 1,672) means complex multi-step autonomous workflows are better handled by other models currently. Meta has acknowledged all of these gaps publicly.

Who built Meta Muse Spark?

Muse Spark was built by Meta Superintelligence Labs (MSL), the internal AI division led by Chief AI Officer Alexandr Wang. Wang joined Meta nine months before Muse Spark’s launch as part of Meta’s $14.3 billion investment in Scale AI. MSL rebuilt Meta’s entire AI stack from scratch — new architecture, new infrastructure, new data pipelines — over the nine-month development cycle. Wang has confirmed that larger Muse models are already in development.

Does Muse Spark have a Shopping mode and what does it do?

Yes. Shopping mode combines Muse Spark’s LLM capabilities with user interest and behavioral data from Meta’s platforms (Facebook, Instagram). It surfaces product recommendations, styling inspiration, and brand content from creators users already follow. Future versions will weave Reels, photos, and posts directly into AI responses with creator credit. Shopping mode is the clearest signal of Meta’s monetization strategy: AI-driven commerce at scale rather than subscriptions.

How does Muse Spark’s token efficiency compare to other models?

Muse Spark completed the full Artificial Analysis Intelligence Index v4.0 evaluation using 58 million output tokens — comparable to Gemini 3.1 Pro (58M) but far below Claude Opus 4.6 (157M) and GPT-5.4 (120M). This efficiency is a direct result of the Contemplating mode’s parallel agent architecture and Meta’s RL training approach that penalizes excessive thinking time. At the scale of Meta’s billions of daily users, the compute cost difference translates to substantial operational advantages.

The Bottom Line on Meta Muse Spark

Meta Muse Spark is a real comeback. Not a complete comeback — the coding and abstract reasoning gaps are significant, the open-source reversal will cost developer goodwill, and the privacy framework underneath the product raises legitimate questions that deserve careful consideration. But comparing Llama 4 Maverick’s Intelligence Index score of 18 to Muse Spark’s 52, achieved in nine months through a complete ground-up rebuild, is a result that demands to be taken seriously.

The health AI dominance (42.8 on HealthBench Hard, more than double Gemini’s score) is not luck. It is a deliberate strategic position that took physician expertise to build and will take time for competitors to replicate. The Contemplating mode’s multi-agent parallel architecture is technically interesting and produces benchmark results that beat GPT-5.4 Pro and Gemini Deep Think on Humanity’s Last Exam. The token efficiency at Gemini-comparable scores, delivered free at scale, changes the economics of frontier AI deployment.

The question is not whether Muse Spark is good. It clearly is. The question is whether it is the right model for your specific use case — and the answer to that question requires reading the benchmark table carefully rather than accepting the headline framing from either direction.

Add it to your toolkit. Use it where it leads. Keep your existing tools where they lead. And watch the Muse v2 announcement closely — if Meta’s scaling laws hold and the next generation arrives in another nine months, the gap between Muse Spark and the current frontier leaders could close faster than most people expect.

Explore more on PrimeAIcenter — and if you are tracking the full AI model landscape, our AI Statistics 2026 report and GLM-5V Turbo review round out the picture of where the frontier stands today.