Data: Supabaseread-onlyRetrieved106Live DeepSeeknot runSupabase writesnot run

Radar

Usable radar list over the currently available retrieval evidence. It discloses source, freshness, uncertainty, review status, and citations before treating any item as report-ready signal.

Total retrieved items

106

Visible after filters

50

Included

101

needs_review

5

Excluded

0

Failed

0

Categories

research50product update22other17open source12model release9agent9opinion8media interview5

Source families

Research feeds43Other public sources21Open source18Company/lab17Analysis/media7

Source tiers

T180T218T1.57unreviewed1

Sources

arXiv cs.CL12arXiv cs.CV12arXiv cs.LG10OpenAI News9arXiv cs.AI9Anthropic Python SDK4Lex Fridman4Hugging Face Transformers3

Category tabs

Browse the visible public retrieval set by signal family.

Selectedresearch

Filters

Query-param filters are applied server-side and do not change the retrieval source.

Reset
CaveatsCompletenessnot claimed
  • Read-only Supabase public radar retrieval was used; no Supabase write path ran.
  • 5 item(s) are marked needs_review and require human confirmation before confident synthesis.
  • This surface shows available AI Radar evidence only; it is not a claim of complete current AI industry coverage.

Evidence rows

Dense rows keep source, status, confidence, timing, and citation visible next to the claim.

Visible items50
01includedConfidence87%Overall0.93TierT1

An OpenAI model has disproved a central conjecture in discrete geometry

An OpenAI model solved the 80-year-old unit distance problem, disproving a central conjecture in discrete geometry, marking a milestone in AI-driven mathematics.

Why it matters: May affect model capability tracking and product benchmarking: An OpenAI model has disproved a central conjecture in discrete geometry
02includedConfidence87%Overall0.93TierT1

Newsroom

Anthropic's newsroom page, collected on May 22, 2026, features recent announcements including the launch of Claude Opus 4.7 (April 16, 2026), Claude Design (April 17, 2026), Project Glasswing (April 7, 2026), and insights from 81,000 user interviews (March 18, 2026).

Why it matters: May affect model capability tracking and product benchmarking: Newsroom
03includedConfidence87%Overall0.91TierT1

Advancing content provenance for a safer, more transparent AI ecosystem

OpenAI advances AI content provenance with Content Credentials, SynthID, and a verification tool to help people identify and trust AI-generated media.

Why it matters: May affect AI deployment risk, governance, or compliance planning: Advancing content provenance for a safer, more transparent AI ecosystem
04includedConfidence22%Overall0.91TierT1

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

GraphDiffMed is a knowledge-constrained medication recommendation framework using dual-scale Differential Attention v2 to filter noise and incorporate pharmacological constraints (e.g., drug-drug interactions), outperforming baselines on MIMIC-III.

Why it matters: May add technical evidence for future radar tracking: GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
05includedConfidence86%Overall0.91TierT1

Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

This study uses a BERT-based LLM for sentiment analysis of Decentraland's MANA token from Discord community, and integrates sentiment scores with multi-modal financial data (price, volume, market cap) in LSTM models for return prediction. Results show neutral sentiment with positive skew, and the multi-modal model significantly outperforms price-only baseline, demonstrating predictive value of community signals.

Why it matters: May add technical evidence for future radar tracking: Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token
06includedConfidence87%Overall0.91TierT1

TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

TabPFN-MT is a natively multitask in-context learner for tabular data. It uses an expanded y-encoder and a shared decoder to enable simultaneous inference of multiple targets, reducing inference cost from O(T) to O(1). Evaluations on 344 datasets show it achieves state-of-the-art deep tabular multitask learning on small datasets (average <1000 samples), with an overall Accuracy rank of 4.89 on multitask datasets, while remaining competitive with top single-task ensembles.

Why it matters: May add technical evidence for future radar tracking: TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data
07includedConfidence86%Overall0.91TierT1

Why Latent Actions Fail, and How to Prevent It

This paper analyzes how exogenous state (e.g., background clutter) hinders latent action learning from unlabeled videos. By extending a linear latent action model to explicitly model exogenous state, the authors find that minimizing the standard reconstruction objective encodes exogenous information from future observations, and learning in a representation space focused on endogenous components is key to mitigating noise. Additionally, previously proposed auxiliary objectives like action-supervision provably encourage latent actions to be consistent across exogenous states. Experiments on linear and nonlinear models validate the findings.

Why it matters: May add technical evidence for future radar tracking: Why Latent Actions Fail, and How to Prevent It
08includedConfidence86%Overall0.91TierT1

Dimensional Balance Improves Large Scale Spatiotemporal Prediction Performance

This paper proposes a dimensional balance framework that uses spatial and temporal entropy diagnostics to harmonize feature representations via low-rank matrix embedding and extended temporal horizon, achieving substantial accuracy gains on urban traffic, meteorological, and epidemic datasets.

Why it matters: May add technical evidence for future radar tracking: Dimensional Balance Improves Large Scale Spatiotemporal Prediction Performance
09includedConfidence86%Overall0.91TierT1

Harnessing Self-Supervised Features for Art Classification

This paper systematically investigates the effectiveness of self-supervised features for artwork classification and retrieval, using DINO and CLIP models. Results show consistent improvements with self-supervised backbones, and insights into real-world applications such as VR museum navigation are provided.

Why it matters: May add technical evidence for future radar tracking: Harnessing Self-Supervised Features for Art Classification
10includedConfidence86%Overall0.91TierT1

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

HELLoRA is a parameter-efficient fine-tuning method for Mixture-of-Experts (MoE) models that attaches LoRA modules only to the most frequently activated experts per layer, reducing trainable parameters and adapter FLOPs while improving downstream performance. Evaluated on OlMoE, Mixtral, and DeepSeekMoE, it outperforms vanilla LoRA with significantly fewer parameters and higher accuracy and training throughput.

Why it matters: May add technical evidence for future radar tracking: HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
11includedConfidence86%Overall0.91TierT1

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

MotionMERGE is a unified framework that achieves fine-grained human motion editing, reasoning, and generation by explicitly modeling motion at part and temporal levels within a single LLM. It introduces ReasoningAware Granularity-Synergy pre-training and curates a large-scale dataset MotionFineEdit (837K atomic + 144K complex triplets) with fine-grained spatio-temporal corrective instructions and motion-grounded chain-of-thought annotations. Extensive experiments demonstrate superior precision in motion generation, understanding, and editing, as well as compelling zero-shot generalization.

Why it matters: May add technical evidence for future radar tracking: MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation
12includedConfidence86%Overall0.91TierT1

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

This paper identifies the 'Annotation Scarcity Paradox' in low-resource NLP evaluation, where model scaling outpaces sovereign human infrastructure. It reviews three phases from 2014 to present and discusses responses like data augmentation and model-based evaluation, calling for a paradigm shift to community-embedded evaluation.

Why it matters: May add technical evidence for future radar tracking: The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints
13includedConfidence86%Overall0.91TierT1

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

This paper proposes F^3A, a training-free visual token pruning router for multimodal language models, which efficiently allocates tokens under a fixed budget via task-conditioned evidence search, requiring no extra LLM forward pass.

Why it matters: May add technical evidence for future radar tracking: How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A
14includedConfidence86%Overall0.91TierT1

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

This paper systematically optimizes real-time diffusion model inference on Apple M3 Ultra (60-core GPU, 512GB unified memory). Across 10 phases, techniques including CoreML conversion, quantization, Token Merging, and Neural Engine utilization are evaluated. The best result (22.7 FPS at 512x512) is achieved by combining CoreML-converted distilled model SDXS-512 with a three-thread camera pipeline. Key findings show that CUDA-optimization insights (e.g., quantization speedup, parallel inference) do not transfer to Apple Silicon, revealing a distinct optimization landscape and providing practical guidelines.

Why it matters: May add technical evidence for future radar tracking: Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
15includedConfidence86%Overall0.91TierT1

The Scaling Laws of Skills in LLM Agent Systems

This study analyzes 15 frontier LLMs, 1,141 real-world skills, and over 3 million routing/execution decisions, identifying two coupled scaling laws in LLM agent systems: the routing law (single-step routing accuracy decays logarithmically with library size) and the execution law (correct execution improves difficult downstream decisions by about 4Γ—). A single parameter b couples the two laws. Law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and improves pass rates on downstream benchmarks. Results show agent performance depends not only on model capability but also on skill library structure, granularity, and exposure policy.

Why it matters: May add technical evidence for future radar tracking: The Scaling Laws of Skills in LLM Agent Systems
16includedConfidence22%Overall0.91TierT1

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

AgentStop is a lightweight efficiency supervisor for locally deployed LLM agents that predicts and terminates unlikely-to-succeed trajectories, reducing energy waste by 15-20% with minimal performance impact (<5% utility drop).

Why it matters: May add technical evidence for future radar tracking: AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
17includedConfidence87%Overall0.91TierT1

Deep Pre-Alignment for VLMs

This paper proposes Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver to deeply align visual features with the text space of the target LLM. DPA improves baselines by 1.9 points on 8 multimodal benchmarks at 4B scale and 3.0 points at 32B scale, while reducing language capability forgetting by 32.9%. Gains are consistent across Qwen3 and LLaMA 3.2 families.

Why it matters: May add technical evidence for future radar tracking: Deep Pre-Alignment for VLMs
18includedConfidence86%Overall0.91TierT1

Fluency and Faithfulness in Human and Machine Literary Translation

This study analyzes 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations, and finds a consistent negative correlation between fluency and faithfulness, except for TranslateGemma where the correlation is weaker and often non-significant, suggesting a tradeoff between fluency and faithfulness in literary translation and that segment length matters for automatic evaluation.

Why it matters: May add technical evidence for future radar tracking: Fluency and Faithfulness in Human and Machine Literary Translation
19includedConfidence86%Overall0.91TierT1

One Pass Is Not Enough: Recursive Latent Refinement for Generative Models

This paper introduces RTM, which replaces single-pass latent mapping with recursive latent refinement to improve both quality and diversity in image generation. It argues that FID is saturated and conflates fidelity with mode coverage. RTM integrated with IMLE achieves the highest precision and recall among SOTA methods on CIFAR-10, CelebA-HQ, and few-shot benchmarks, while maintaining competitive FID, and also improves StyleGAN2 variants.

Why it matters: May add technical evidence for future radar tracking: One Pass Is Not Enough: Recursive Latent Refinement for Generative Models
20includedConfidence22%Overall0.91TierT1

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

This study conducts a controlled empirical evaluation of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 to 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Results show that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, and models' willingness to select 'unknown' answers declines by 17.4%. Standard quality metrics like perplexity increase less than 0.5% at 8-bit and under 3% at 4-bit, yet 2.5-5.6% of items already develop new biases at 4-bit, demonstrating that aggregate metrics systematically miss fairness-critical degradation.

Why it matters: May affect AI deployment risk, governance, or compliance planning: Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels
21includedConfidence22%Overall0.91TierT1

ReactiveGWM: Steering NPC in Reactive Game World Models

ReactiveGWM is a reactive game world model that decouples player controls from NPC behaviors using additive bias and cross-attention modules, enabling dynamic interactions and zero-shot strategy transfer. Evaluated on Street Fighter games, it maintains player controllability and achieves prompt-aligned NPC strategy adherence.

Why it matters: May add technical evidence for future radar tracking: ReactiveGWM: Steering NPC in Reactive Game World Models
22includedConfidence82%Overall0.91TierT1

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

This arXiv cs.AI paper introduces SDOF, a framework that models multi-agent orchestration as a constrained state machine, using an online-RLHF intent router (trained via GRPO) and a state-aware dispatcher to enforce business stage constraints. Evaluated on a recruitment system (Beisen iTalent, 6000+ enterprises), the 7B model achieves 80.9% joint accuracy on an FSM-constrained benchmark (GPT-4o: 48.9%), end-to-end task completion rate of 86.5%, and blocks all 22 injection/illegal operations. Message-level blocking achieves 100% precision and 88% recall.

Why it matters: May add technical evidence for future radar tracking: SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch
23includedConfidence86%Overall0.89TierT1

AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

This paper proposes a three-stage framework to assess learner competency from egocentric nursing simulation videos, using frozen visual encoders (DINOv2) and few-shot learning for action recognition. On 22 sessions (3.8 hours, 493 actions), it achieves 57.4% MOF in leave-one-out 1-shot recognition. The study finds a negative correlation between recognition accuracy and competency (rho = -0.524, p=0.012 for mIoU): higher-competency students exhibit more diverse and harder-to-classify workflows but more protocol-consistent transitions. This suggests recognition accuracy as a pedagogically informative signal for automated competency assessment.

Why it matters: May add technical evidence for future radar tracking: AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education
24includedConfidence22%Overall0.89TierT1

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

This paper investigates the performance of quantized LLaMA-3.1 (8B) models in qualitative analysis, focusing on different quantization levels (2-8 bit) and types. To address hallucinations and instability in low-bit models, it proposes a quantization-aware multi-pass prompt verification method that reduces hallucinations through controlled steps. Experiments using 82 interview transcripts compare against a gold standard (BF16 model and human coding). Results show 8-bit models perform closest to the gold standard; 4-bit models become stable with the method; 3-bit and 2-bit models degrade but improve with the approach. The method enables low-resource LLMs to be more stable and accurate for qualitative research at lower cost.

Why it matters: May add technical evidence for future radar tracking: Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification
25includedConfidence87%Overall0.89TierT1

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

This paper presents a microservice architecture for operationalizing Document AI, encapsulating pipelines of classification, OCR, and LLM-based structured field extraction in production. Key design decisions include hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous IO processing, and independent horizontal scaling. Batch profiling reveals two surprising findings: OCR dominates end-to-end latency, and system saturation is determined by shared GPU-inference capacity rather than worker count. The goal is to provide practitioners with concrete architectural patterns for production-grade document understanding systems.

Why it matters: May add technical evidence for future radar tracking: Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
26includedConfidence82%Overall0.89TierT1

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

This position paper advocates for developing systematic methodologies called 'data probes'β€”synthetic sequences generated from appropriately defined random processesβ€”to fundamentally understand how data characteristics affect LLM performance, generalization, and robustness. The authors argue that current compute-intensive, heuristic-based approaches lack principled understanding, and propose using theoretical concepts like typical sets to analyze probe sequences, offering a pathway to foundational insights beyond empirical heuristics.

Why it matters: May add technical evidence for future radar tracking: Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance
27includedConfidence86%Overall0.89TierT1

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

This paper proposes COSMO-Agent, a tool-augmented reinforcement learning framework that bridges the CAD-CAE semantic gap in industrial design-simulation optimization. It casts CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment where an LLM learns to orchestrate external tools and revise parametric geometries. A multi-constraint reward and an industry-aligned dataset covering 25 component categories are introduced. Experiments show COSMO-Agent training substantially improves small open-source LLMs, exceeding larger models in feasibility, efficiency, and stability.

Why it matters: May add technical evidence for future radar tracking: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
28includedConfidence86%Overall0.89TierT1

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Artifact-Bench is a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) on detecting and analyzing artifacts in AI-generated videos. It establishes a three-level hierarchical taxonomy of realism artifacts covering photorealistic, animated, and CG-style videos, and defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or below-random performance in challenging settings, and significant misalignment between MLLM judgments and human perceptual preferences.

Why it matters: May add technical evidence for future radar tracking: Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
29includedConfidence86%Overall0.89TierT1

Robust Basis Spline Decoupling for the Compression of Transformer Models

This paper introduces a B-spline-based decoupling framework for compressing transformer models. It proposes a robust alternating least-squares algorithm (R-CMTF-BSD) using constrained coupled matrix-tensor factorization, achieving substantial parameter reduction while maintaining competitive accuracy on Vision and Swin Transformer architectures.

Why it matters: May add technical evidence for future radar tracking: Robust Basis Spline Decoupling for the Compression of Transformer Models
30includedConfidence82%Overall0.89TierT1

Noise2Params: Unification and Parameter Determination from Noise via a Probabilistic Event Camera Model

This paper develops a probabilistic model for event cameras based on photon statistics, unifying static scene noise events and step response curves. It proposes Noise2Params, a method to determine camera-specific parameters (B, Ξ±, ΞΈ) by minimizing error against observed noise distributions, requiring only recordings of static uniform scenes. Experiments show that CNNs trained on synthetic noise data from the model outperform those trained solely on experimental data in static scene reconstruction.

Why it matters: May add technical evidence for future radar tracking: Noise2Params: Unification and Parameter Determination from Noise via a Probabilistic Event Camera Model
31includedConfidence86%Overall0.89TierT1

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

This paper proposes StrLoRA, a framework for Multimodal Large Language Models in Streaming Continual Visual Instruction Tuning (Streaming CVIT). Streaming CVIT is a new, more realistic setting where data arrives as continuous chunks of dynamically mixed tasks. StrLoRA uses a regularized two-stage expert routing: task-aware expert selection via textual instruction, token-wise expert weighting via cross-modal attention, and routing-stability regularization. Experiments on a new StrCVIT benchmark show StrLoRA substantially outperforms existing methods.

Why it matters: May change available building blocks for teams evaluating open implementations: StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs
32includedConfidence84%Overall0.89TierT1

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

This study examines whether improvements in Theory of Mind (ToM) for LLMs truly benefit dynamic human-AI interactions. By proposing an interactive evaluation paradigm and systematically studying four ToM enhancement techniques, it finds that gains on static benchmarks do not necessarily translate to better performance in dynamic interactions, highlighting the need for interaction-based assessments.

Why it matters: May add technical evidence for future radar tracking: Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations
33includedConfidence82%Overall0.89TierT1

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

This paper identifies a compounding occupancy shift failure in sequential fine-tuning of multi-agent LLMs and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.

Why it matters: May change available building blocks for teams evaluating open implementations: TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
34includedConfidence80%Overall0.88TierT1.5

Yarin Gal - Home Page | Oxford Machine Learning

Home page of Yarin Gal, a researcher at Oxford Machine Learning. The page serves as a portal with links to his research, publications, talks, software, blog, and other resources.

Why it matters: May add technical evidence for future radar tracking: Yarin Gal - Home Page | Oxford Machine Learning
35includedConfidence86%Overall0.87TierT1

Evaluating the Utility of Personal Health Records in Personalized Health AI

This paper evaluates LLMs (Gemini 3.0 Flash) for answering health queries using Personal Health Records (PHRs). 2,257 queries from three sources were matched with 1,945 de-identified PHRs. Gemini responses were generated with no PHR context, a basic summary, or full clinical notes. Evaluation used SHARP and a new framework for PHR-specific errors. Significant improvements in helpfulness with PHR data (p<0.001), and potential gains in safety, accuracy, relevance, and personalization. Gaps such as temporal disorientation and rare confabulations were identified. The study supports PHR data potential and provides a monitoring framework.

Why it matters: May add technical evidence for future radar tracking: Evaluating the Utility of Personal Health Records in Personalized Health AI
36includedConfidence86%Overall0.87TierT1

Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models

This paper proposes a neural framework to estimate pairwise conditional mutual information (MI) directly from the hidden states of a pretrained masked diffusion model (MDM), using ground-truth MI computed from the model's own conditional distributions for supervision. The estimator predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent variable subsets. Evaluated on Sudoku and protein sequence generation with ESM-C, the method achieves a 3-5x reduction in inference-time forward passes while preserving generative quality and outperforming entropy-based parallelization methods.

Why it matters: May add technical evidence for future radar tracking: Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models
37includedConfidence86%Overall0.87TierT1

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

This paper introduces OSCToM, an RL-guided approach for generating high-order Theory of Mind conflicts to improve LLMs' recursive reasoning in complex social settings. It achieves 76% accuracy on FANToM and is 6x more efficient in data synthesis.

Why it matters: May add technical evidence for future radar tracking: OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
38includedConfidence86%Overall0.87TierT1

Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs

This paper investigates how LLMs represent disability by simulating social media posts from the perspective of individuals with disabilities, comparing them with posts by real disabled people. It finds that LLMs tend to idealize disability experiences with overly positive stereotypes, and exhibit negative bias by disproportionately associating topics like career and entertainment with non-disabled individuals.

Why it matters: May add technical evidence for future radar tracking: Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs
39includedConfidence86%Overall0.87TierT1

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

Proposes SOLAR, a self-optimizing lifelong autonomous reasoner that leverages parameter-level meta-learning and multi-level reinforcement learning for continual adaptation without gradient updates, outperforming strong baselines on commonsense, math, medical, coding, social, and logical reasoning tasks.

Why it matters: May add technical evidence for future radar tracking: SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
40includedConfidence87%Overall0.87TierT1

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across four language pairs (Egyptian Arabic-English, Saudi Arabic-English, Persian-English, German-English). Each dataset contains 300 samples selected via a two-stage pipeline. ElevenLabs Scribe v2 achieved the lowest WER (13.2% overall) and highest BERTScore (0.936 overall). The authors argue BERTScore is more reliable for Arabic and Persian due to transliteration variance. The dataset is publicly available.

Why it matters: May add technical evidence for future radar tracking: Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
41includedConfidence86%Overall0.87TierT1

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD is a bounded neuro-symbolic architecture for zero-shot dialogue state tracking. It reformulates NLU as discrete tool calls within a self-correcting ReAct loop with deterministic validation. On MultiWOZ 2.1, it achieves 52.71% joint goal accuracy with gpt-oss-20B (14 points improvement) and 47.34% with Qwen3-8B. On SGD, Claude-Opus-4.6 achieves 80.68% JGA. The architecture improves accuracy by up to 9.3% over single-pass inference and achieves 93.1% self-correction rate on intercepted errors.

Why it matters: May add technical evidence for future radar tracking: ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
42includedConfidence86%Overall0.87TierT1

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

The paper introduces PQR, a framework for automatically generating diverse and realistic user queries that elicit failures (e.g., unhelpfulness, unsafety) in LLM-based QA agents. It operates via iterative interaction between a query refinement module and a prompt refinement module, producing failure-triggering queries that resemble real user intents. Evaluated on an e-commerce QA agent, PQR uncovers 23%-78% more unhelpful responses and generates more diverse and realistic queries than previous methods.

Why it matters: May add technical evidence for future radar tracking: PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
43includedConfidence86%Overall0.87TierT1

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

This paper introduces OP-Mix, a data mixing algorithm for the entire language model training lifecycle. It cheaply simulates candidate data mixtures by interpolating low-rank adapters trained on the current model, eliminating separate proxy models. In pretraining, OP-Mix improves average perplexity by 6.3%; in continual learning, it matches retraining and on-policy distillation while using 66% and 95% less compute, respectively.

Why it matters: May add technical evidence for future radar tracking: Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
44includedConfidence86%Overall0.87TierT1

DeepSlide: From Artifacts to Presentation Delivery

DeepSlide is a human-in-the-loop multi-agent system that supports the full presentation preparation process, from requirement elicitation and time-budgeted narrative planning to evidence-grounded slide-script generation, attention augmentation, and rehearsal support. It integrates a controllable logical-chain planner, a lightweight content-tree retriever, Markov-style sequential rendering with style inheritance, and sandboxed execution. A dual-scoreboard benchmark separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while achieving larger gains on delivery metrics such as narrative flow, pacing precision, slide-script synergy, and clearer attention guidance.

Why it matters: May add technical evidence for future radar tracking: DeepSlide: From Artifacts to Presentation Delivery
45includedConfidence86%Overall0.87TierT1

DiscoExplorer: An Open Interface for the Study of Multilingual Discourse Relations

This paper presents DiscoExplorer, an open source web interface for studying multilingual discourse relations. It makes datasets from the DISRPT Shared Task publicly available, covering 16 languages, and provides query, search, and visualization facilities for relations and signaling devices such as connectives.

Why it matters: May change available building blocks for teams evaluating open implementations: DiscoExplorer: An Open Interface for the Study of Multilingual Discourse Relations
46includedConfidence83%Overall0.83TierT1.5

Lil'Log

Homepage of Lilian Weng's personal blog 'Lil'Log', described as 'Document my learning notes.' It is a technical blog in AI domain, source weight 0.78, language English.

Why it matters: May add technical evidence for future radar tracking: Lil'Log
47includedConfidence80%Overall0.83TierT2

Journal of Machine Learning Research

JMLR (Journal of Machine Learning Research) is a machine learning research journal founded in 2000, with all published papers freely available online.

Why it matters: May add technical evidence for future radar tracking: Journal of Machine Learning Research
48includedConfidence79%Overall0.83TierT2

Home - colah's blog

Homepage of Christopher Olah's blog, featuring high-quality original technical articles often reposted by Chinese AI media.

Why it matters: May add technical evidence for future radar tracking: Home - colah's blog
49includedConfidence87%Overall0.73TierT1

Leveraging Vision-Language Models to Detect Attention in Educational Videos

This paper investigates using Vision-Language Models (VLMs) to detect attention in educational videos, but finds that prompting strategies with Gemini 3 fail to outperform statistical baselines, highlighting limitations of VLMs for real-time educational diagnostics.

Why it matters: May add technical evidence for future radar tracking: Leveraging Vision-Language Models to Detect Attention in Educational Videos
50includedConfidence70%Overall0.65TierT2

Our Insights | Coatue

The Coatue Insights page serves as a central hub for Coatue, a lifecycle investment platform, featuring their latest perspectives, portfolio updates, and industry analysis. Recent content includes a public markets update deck from May 6, 2026, a partnership announcement with Anthropic, and daily charts.

Why it matters: May add technical evidence for future radar tracking: Our Insights | Coatue

Visible citations

SourceAnthropic NewsCollectedMay 22, 2026, 02:44 AM UTCStatus: includedConfidence87%
Newsroom

https://www.anthropic.com/news