An OpenAI model has disproved a central conjecture in discrete geometry
An OpenAI model solved the 80-year-old unit distance problem, disproving a central conjecture in discrete geometry, marking a milestone in AI-driven mathematics.
Usable radar list over the currently available retrieval evidence. It discloses source, freshness, uncertainty, review status, and citations before treating any item as report-ready signal.
Total retrieved items
106
Visible after filters
50
Included
101
needs_review
5
Excluded
0
Failed
0
Categories
Source families
Source tiers
Sources
Browse the visible public retrieval set by signal family.
Query-param filters are applied server-side and do not change the retrieval source.
Dense rows keep source, status, confidence, timing, and citation visible next to the claim.
An OpenAI model solved the 80-year-old unit distance problem, disproving a central conjecture in discrete geometry, marking a milestone in AI-driven mathematics.
Anthropic's newsroom page, collected on May 22, 2026, features recent announcements including the launch of Claude Opus 4.7 (April 16, 2026), Claude Design (April 17, 2026), Project Glasswing (April 7, 2026), and insights from 81,000 user interviews (March 18, 2026).
OpenAI advances AI content provenance with Content Credentials, SynthID, and a verification tool to help people identify and trust AI-generated media.
GraphDiffMed is a knowledge-constrained medication recommendation framework using dual-scale Differential Attention v2 to filter noise and incorporate pharmacological constraints (e.g., drug-drug interactions), outperforming baselines on MIMIC-III.
This study uses a BERT-based LLM for sentiment analysis of Decentraland's MANA token from Discord community, and integrates sentiment scores with multi-modal financial data (price, volume, market cap) in LSTM models for return prediction. Results show neutral sentiment with positive skew, and the multi-modal model significantly outperforms price-only baseline, demonstrating predictive value of community signals.
TabPFN-MT is a natively multitask in-context learner for tabular data. It uses an expanded y-encoder and a shared decoder to enable simultaneous inference of multiple targets, reducing inference cost from O(T) to O(1). Evaluations on 344 datasets show it achieves state-of-the-art deep tabular multitask learning on small datasets (average <1000 samples), with an overall Accuracy rank of 4.89 on multitask datasets, while remaining competitive with top single-task ensembles.
This paper analyzes how exogenous state (e.g., background clutter) hinders latent action learning from unlabeled videos. By extending a linear latent action model to explicitly model exogenous state, the authors find that minimizing the standard reconstruction objective encodes exogenous information from future observations, and learning in a representation space focused on endogenous components is key to mitigating noise. Additionally, previously proposed auxiliary objectives like action-supervision provably encourage latent actions to be consistent across exogenous states. Experiments on linear and nonlinear models validate the findings.
This paper proposes a dimensional balance framework that uses spatial and temporal entropy diagnostics to harmonize feature representations via low-rank matrix embedding and extended temporal horizon, achieving substantial accuracy gains on urban traffic, meteorological, and epidemic datasets.
This paper systematically investigates the effectiveness of self-supervised features for artwork classification and retrieval, using DINO and CLIP models. Results show consistent improvements with self-supervised backbones, and insights into real-world applications such as VR museum navigation are provided.
HELLoRA is a parameter-efficient fine-tuning method for Mixture-of-Experts (MoE) models that attaches LoRA modules only to the most frequently activated experts per layer, reducing trainable parameters and adapter FLOPs while improving downstream performance. Evaluated on OlMoE, Mixtral, and DeepSeekMoE, it outperforms vanilla LoRA with significantly fewer parameters and higher accuracy and training throughput.
MotionMERGE is a unified framework that achieves fine-grained human motion editing, reasoning, and generation by explicitly modeling motion at part and temporal levels within a single LLM. It introduces ReasoningAware Granularity-Synergy pre-training and curates a large-scale dataset MotionFineEdit (837K atomic + 144K complex triplets) with fine-grained spatio-temporal corrective instructions and motion-grounded chain-of-thought annotations. Extensive experiments demonstrate superior precision in motion generation, understanding, and editing, as well as compelling zero-shot generalization.
This paper identifies the 'Annotation Scarcity Paradox' in low-resource NLP evaluation, where model scaling outpaces sovereign human infrastructure. It reviews three phases from 2014 to present and discusses responses like data augmentation and model-based evaluation, calling for a paradigm shift to community-embedded evaluation.
This paper proposes F^3A, a training-free visual token pruning router for multimodal language models, which efficiently allocates tokens under a fixed budget via task-conditioned evidence search, requiring no extra LLM forward pass.
This paper systematically optimizes real-time diffusion model inference on Apple M3 Ultra (60-core GPU, 512GB unified memory). Across 10 phases, techniques including CoreML conversion, quantization, Token Merging, and Neural Engine utilization are evaluated. The best result (22.7 FPS at 512x512) is achieved by combining CoreML-converted distilled model SDXS-512 with a three-thread camera pipeline. Key findings show that CUDA-optimization insights (e.g., quantization speedup, parallel inference) do not transfer to Apple Silicon, revealing a distinct optimization landscape and providing practical guidelines.
This study analyzes 15 frontier LLMs, 1,141 real-world skills, and over 3 million routing/execution decisions, identifying two coupled scaling laws in LLM agent systems: the routing law (single-step routing accuracy decays logarithmically with library size) and the execution law (correct execution improves difficult downstream decisions by about 4Γ). A single parameter b couples the two laws. Law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and improves pass rates on downstream benchmarks. Results show agent performance depends not only on model capability but also on skill library structure, granularity, and exposure policy.
AgentStop is a lightweight efficiency supervisor for locally deployed LLM agents that predicts and terminates unlikely-to-succeed trajectories, reducing energy waste by 15-20% with minimal performance impact (<5% utility drop).
This paper proposes Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver to deeply align visual features with the text space of the target LLM. DPA improves baselines by 1.9 points on 8 multimodal benchmarks at 4B scale and 3.0 points at 32B scale, while reducing language capability forgetting by 32.9%. Gains are consistent across Qwen3 and LLaMA 3.2 families.
This study analyzes 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations, and finds a consistent negative correlation between fluency and faithfulness, except for TranslateGemma where the correlation is weaker and often non-significant, suggesting a tradeoff between fluency and faithfulness in literary translation and that segment length matters for automatic evaluation.
This paper introduces RTM, which replaces single-pass latent mapping with recursive latent refinement to improve both quality and diversity in image generation. It argues that FID is saturated and conflates fidelity with mode coverage. RTM integrated with IMLE achieves the highest precision and recall among SOTA methods on CIFAR-10, CelebA-HQ, and few-shot benchmarks, while maintaining competitive FID, and also improves StyleGAN2 variants.
This study conducts a controlled empirical evaluation of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 to 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Results show that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, and models' willingness to select 'unknown' answers declines by 17.4%. Standard quality metrics like perplexity increase less than 0.5% at 8-bit and under 3% at 4-bit, yet 2.5-5.6% of items already develop new biases at 4-bit, demonstrating that aggregate metrics systematically miss fairness-critical degradation.
ReactiveGWM is a reactive game world model that decouples player controls from NPC behaviors using additive bias and cross-attention modules, enabling dynamic interactions and zero-shot strategy transfer. Evaluated on Street Fighter games, it maintains player controllability and achieves prompt-aligned NPC strategy adherence.
This arXiv cs.AI paper introduces SDOF, a framework that models multi-agent orchestration as a constrained state machine, using an online-RLHF intent router (trained via GRPO) and a state-aware dispatcher to enforce business stage constraints. Evaluated on a recruitment system (Beisen iTalent, 6000+ enterprises), the 7B model achieves 80.9% joint accuracy on an FSM-constrained benchmark (GPT-4o: 48.9%), end-to-end task completion rate of 86.5%, and blocks all 22 injection/illegal operations. Message-level blocking achieves 100% precision and 88% recall.
This paper proposes a three-stage framework to assess learner competency from egocentric nursing simulation videos, using frozen visual encoders (DINOv2) and few-shot learning for action recognition. On 22 sessions (3.8 hours, 493 actions), it achieves 57.4% MOF in leave-one-out 1-shot recognition. The study finds a negative correlation between recognition accuracy and competency (rho = -0.524, p=0.012 for mIoU): higher-competency students exhibit more diverse and harder-to-classify workflows but more protocol-consistent transitions. This suggests recognition accuracy as a pedagogically informative signal for automated competency assessment.
This paper investigates the performance of quantized LLaMA-3.1 (8B) models in qualitative analysis, focusing on different quantization levels (2-8 bit) and types. To address hallucinations and instability in low-bit models, it proposes a quantization-aware multi-pass prompt verification method that reduces hallucinations through controlled steps. Experiments using 82 interview transcripts compare against a gold standard (BF16 model and human coding). Results show 8-bit models perform closest to the gold standard; 4-bit models become stable with the method; 3-bit and 2-bit models degrade but improve with the approach. The method enables low-resource LLMs to be more stable and accurate for qualitative research at lower cost.
This paper presents a microservice architecture for operationalizing Document AI, encapsulating pipelines of classification, OCR, and LLM-based structured field extraction in production. Key design decisions include hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous IO processing, and independent horizontal scaling. Batch profiling reveals two surprising findings: OCR dominates end-to-end latency, and system saturation is determined by shared GPU-inference capacity rather than worker count. The goal is to provide practitioners with concrete architectural patterns for production-grade document understanding systems.
This position paper advocates for developing systematic methodologies called 'data probes'βsynthetic sequences generated from appropriately defined random processesβto fundamentally understand how data characteristics affect LLM performance, generalization, and robustness. The authors argue that current compute-intensive, heuristic-based approaches lack principled understanding, and propose using theoretical concepts like typical sets to analyze probe sequences, offering a pathway to foundational insights beyond empirical heuristics.
This paper proposes COSMO-Agent, a tool-augmented reinforcement learning framework that bridges the CAD-CAE semantic gap in industrial design-simulation optimization. It casts CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment where an LLM learns to orchestrate external tools and revise parametric geometries. A multi-constraint reward and an industry-aligned dataset covering 25 component categories are introduced. Experiments show COSMO-Agent training substantially improves small open-source LLMs, exceeding larger models in feasibility, efficiency, and stability.
Artifact-Bench is a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) on detecting and analyzing artifacts in AI-generated videos. It establishes a three-level hierarchical taxonomy of realism artifacts covering photorealistic, animated, and CG-style videos, and defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or below-random performance in challenging settings, and significant misalignment between MLLM judgments and human perceptual preferences.
This paper introduces a B-spline-based decoupling framework for compressing transformer models. It proposes a robust alternating least-squares algorithm (R-CMTF-BSD) using constrained coupled matrix-tensor factorization, achieving substantial parameter reduction while maintaining competitive accuracy on Vision and Swin Transformer architectures.
This paper develops a probabilistic model for event cameras based on photon statistics, unifying static scene noise events and step response curves. It proposes Noise2Params, a method to determine camera-specific parameters (B, Ξ±, ΞΈ) by minimizing error against observed noise distributions, requiring only recordings of static uniform scenes. Experiments show that CNNs trained on synthetic noise data from the model outperform those trained solely on experimental data in static scene reconstruction.
This paper proposes StrLoRA, a framework for Multimodal Large Language Models in Streaming Continual Visual Instruction Tuning (Streaming CVIT). Streaming CVIT is a new, more realistic setting where data arrives as continuous chunks of dynamically mixed tasks. StrLoRA uses a regularized two-stage expert routing: task-aware expert selection via textual instruction, token-wise expert weighting via cross-modal attention, and routing-stability regularization. Experiments on a new StrCVIT benchmark show StrLoRA substantially outperforms existing methods.
This study examines whether improvements in Theory of Mind (ToM) for LLMs truly benefit dynamic human-AI interactions. By proposing an interactive evaluation paradigm and systematically studying four ToM enhancement techniques, it finds that gains on static benchmarks do not necessarily translate to better performance in dynamic interactions, highlighting the need for interaction-based assessments.
This paper identifies a compounding occupancy shift failure in sequential fine-tuning of multi-agent LLMs and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.
Home page of Yarin Gal, a researcher at Oxford Machine Learning. The page serves as a portal with links to his research, publications, talks, software, blog, and other resources.
This paper evaluates LLMs (Gemini 3.0 Flash) for answering health queries using Personal Health Records (PHRs). 2,257 queries from three sources were matched with 1,945 de-identified PHRs. Gemini responses were generated with no PHR context, a basic summary, or full clinical notes. Evaluation used SHARP and a new framework for PHR-specific errors. Significant improvements in helpfulness with PHR data (p<0.001), and potential gains in safety, accuracy, relevance, and personalization. Gaps such as temporal disorientation and rare confabulations were identified. The study supports PHR data potential and provides a monitoring framework.
This paper proposes a neural framework to estimate pairwise conditional mutual information (MI) directly from the hidden states of a pretrained masked diffusion model (MDM), using ground-truth MI computed from the model's own conditional distributions for supervision. The estimator predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent variable subsets. Evaluated on Sudoku and protein sequence generation with ESM-C, the method achieves a 3-5x reduction in inference-time forward passes while preserving generative quality and outperforming entropy-based parallelization methods.
This paper introduces OSCToM, an RL-guided approach for generating high-order Theory of Mind conflicts to improve LLMs' recursive reasoning in complex social settings. It achieves 76% accuracy on FANToM and is 6x more efficient in data synthesis.
This paper investigates how LLMs represent disability by simulating social media posts from the perspective of individuals with disabilities, comparing them with posts by real disabled people. It finds that LLMs tend to idealize disability experiences with overly positive stereotypes, and exhibit negative bias by disproportionately associating topics like career and entertainment with non-disabled individuals.
Proposes SOLAR, a self-optimizing lifelong autonomous reasoner that leverages parameter-level meta-learning and multi-level reinforcement learning for continual adaptation without gradient updates, outperforming strong baselines on commonsense, math, medical, coding, social, and logical reasoning tasks.
This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across four language pairs (Egyptian Arabic-English, Saudi Arabic-English, Persian-English, German-English). Each dataset contains 300 samples selected via a two-stage pipeline. ElevenLabs Scribe v2 achieved the lowest WER (13.2% overall) and highest BERTScore (0.936 overall). The authors argue BERTScore is more reliable for Arabic and Persian due to transliteration variance. The dataset is publicly available.
ReacTOD is a bounded neuro-symbolic architecture for zero-shot dialogue state tracking. It reformulates NLU as discrete tool calls within a self-correcting ReAct loop with deterministic validation. On MultiWOZ 2.1, it achieves 52.71% joint goal accuracy with gpt-oss-20B (14 points improvement) and 47.34% with Qwen3-8B. On SGD, Claude-Opus-4.6 achieves 80.68% JGA. The architecture improves accuracy by up to 9.3% over single-pass inference and achieves 93.1% self-correction rate on intercepted errors.
The paper introduces PQR, a framework for automatically generating diverse and realistic user queries that elicit failures (e.g., unhelpfulness, unsafety) in LLM-based QA agents. It operates via iterative interaction between a query refinement module and a prompt refinement module, producing failure-triggering queries that resemble real user intents. Evaluated on an e-commerce QA agent, PQR uncovers 23%-78% more unhelpful responses and generates more diverse and realistic queries than previous methods.
This paper introduces OP-Mix, a data mixing algorithm for the entire language model training lifecycle. It cheaply simulates candidate data mixtures by interpolating low-rank adapters trained on the current model, eliminating separate proxy models. In pretraining, OP-Mix improves average perplexity by 6.3%; in continual learning, it matches retraining and on-policy distillation while using 66% and 95% less compute, respectively.
DeepSlide is a human-in-the-loop multi-agent system that supports the full presentation preparation process, from requirement elicitation and time-budgeted narrative planning to evidence-grounded slide-script generation, attention augmentation, and rehearsal support. It integrates a controllable logical-chain planner, a lightweight content-tree retriever, Markov-style sequential rendering with style inheritance, and sandboxed execution. A dual-scoreboard benchmark separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while achieving larger gains on delivery metrics such as narrative flow, pacing precision, slide-script synergy, and clearer attention guidance.
This paper presents DiscoExplorer, an open source web interface for studying multilingual discourse relations. It makes datasets from the DISRPT Shared Task publicly available, covering 16 languages, and provides query, search, and visualization facilities for relations and signaling devices such as connectives.
Homepage of Lilian Weng's personal blog 'Lil'Log', described as 'Document my learning notes.' It is a technical blog in AI domain, source weight 0.78, language English.
JMLR (Journal of Machine Learning Research) is a machine learning research journal founded in 2000, with all published papers freely available online.
Homepage of Christopher Olah's blog, featuring high-quality original technical articles often reposted by Chinese AI media.
This paper investigates using Vision-Language Models (VLMs) to detect attention in educational videos, but finds that prompting strategies with Gemini 3 fail to outperform statistical baselines, highlighting limitations of VLMs for real-time educational diagnostics.
The Coatue Insights page serves as a central hub for Coatue, a lifecycle investment platform, featuring their latest perspectives, portfolio updates, and industry analysis. Recent content includes a public markets update deck from May 6, 2026, a partnership announcement with Anthropic, and daily charts.
https://openai.com/index/model-disproves-discrete-geometry-conjecture
https://www.anthropic.com/news
https://openai.com/index/advancing-content-provenance
https://arxiv.org/abs/2605.20188
https://arxiv.org/abs/2605.20192
https://arxiv.org/abs/2605.20234
https://arxiv.org/abs/2605.20223
https://arxiv.org/abs/2605.18793
https://arxiv.org/abs/2605.18974
https://arxiv.org/abs/2605.18795
https://arxiv.org/abs/2605.18956
https://arxiv.org/abs/2605.19066