ArXivの論文を検索できるMCPサーバーを試してみる

#Model Context Protocol (MCP)
夏目祐樹
2025.04.22
こんばんは、製造ビジネステクノロジー部の夏目です。
今回はArXivの論文を検索するMCPを試してみます。
 arxiv-mcp-serverhttps://github.com/blazickjp/arxiv-mcp-server
ArXivの論文を検索、ダウンロードできるMCPサーバーです。

公式のMCPサーバーではありませんが、ArXivの論文を検索するMCPサーバーとしては有名なようです。
先日 paper-search-mcp を試した時にArXivの論文検索がうまくいかなかったので、リベンジしようと思います。
 MCPサーバーを設定する{
  "mcpServers": {
    "arxiv": {
      "command": "uvx",
      "args": [
        "arxiv-mcp-server",
        "--storage-path",
        "/path/to/paper/storage"
      ]
    }
  }
}
/path/to/paper/storageには論文を保存したいディレクトリのフルパスを書いてください。
 使ってみる Model Context Protocolについての論文を探してみる
 プロンプトArXivからModel Context Protocolについての論文を20個探してください
 結果ArXivでModel Context Protocolに関する論文を検索します。
arxiv（ローカル）からのsearch_papersの結果を表示{
  `query`: `\"Model Context Protocol\" OR \"context protocol\" machine learning`,
  `max_results`: 20
}
(中略)
検索結果を確認しましたが、残念ながら「Model Context Protocol」に直接関連する論文はないようです。「context protocol」という単語を含む論文はありますが、これらは機械学習の一般的なコンテキストでの言及であり、特定のプロトコルを指しているものではありません。
検索結果には最近のAI研究に関する様々な論文が含まれていますが、Model Context Protocolに特化した研究は見つかりませんでした。代わりに以下のような分野の論文が検索されています：
強化学習によるLLMの推論能力向上に関する研究
知識蒸留（Knowledge Distillation）に関する調査
パラメータ効率の良い継続的ファインチューニングの調査
医療画像のための自己教師あり学習モデル
LLMにおけるロールアウトのダウンサンプリング手法
量子系のメトロロジー
シナプス棘頭部の形態力学
GUIエージェントの開発とベンチマーク
衝突確率の推定と検定のアルゴリズム
模倣学習のための人間のデモンストレーション
もしModel Context Protocolについてより具体的な情報をお求めでしたら、検索キーワードを変更するか、もう少し詳細な情報をいただければ、より的確な検索ができるかもしれません。
今回もうまく検索できませんでした。
 AIカテゴリに絞って検索してみるこのMCPサーバーでは検索の際にカテゴリーを指定できるのでAIカテゴリに限定して検索してみます。
result = await call_tool("search_papers", {
    "query": "transformer architecture",
    "max_results": 10,
    "date_from": "2023-01-01",
    "categories": ["cs.AI", "cs.LG"]
})
https://github.com/blazickjp/arxiv-mcp-server?tab=readme-ov-file#1-paper-search
 プロンプトArXivからModel Context Protocolについての論文を20個探してください
 結果ArXivのAIカテゴリに絞ってModel Context Protocolに関する論文を検索します。
arxiv（ローカル）からのsearch_papersの結果を表示{
  `query`: `\"Model Context Protocol\" OR \"context protocol\" artificial intelligence`,
  `categories`: [
    `cs.AI`
  ],
  `max_results`: 20
}
{
  "total_results": 20,
  "papers": [
    {
      "id": "2504.13837v1",
      "title": "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?",
      "authors": [
        "Yang Yue",
        "Zhiqi Chen",
        "Rui Lu",
        "Andrew Zhao",
        "Zhaokai Wang",
        "Yang Yue",
        "Shiji Song",
        "Gao Huang"
      ],
      "abstract": "Reinforcement Learning with Verifiable Rewards (RLVR) has recently\ndemonstrated notable success in enhancing the reasoning capabilities of LLMs,\nparticularly in mathematics and programming tasks. It is widely believed that\nRLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning\nabilities that exceed corresponding base models' capacity. In this study,\nhowever, we critically re-examines this assumption by measuring the\npass@\\textit{k} metric with large values of \\textit{k} to explore the reasoning\ncapability boundary of the models across a wide range of model families and\nbenchmarks. Surprisingly, the RL does \\emph{not}, in fact, elicit fundamentally\nnew reasoning patterns. While RL-trained models outperform their base models at\nsmaller values of $k$ (\\eg, $k$=1), base models can achieve a comparable or\neven higher pass@$k$ score compared to their RL counterparts at large $k$\nvalues. The reasoning paths generated by RL-trained models are already included\nin the base models' sampling distribution, suggesting that most reasoning\nabilities manifested in RL-trained models are already obtained by base models.\nFurther analysis shows that RL training boosts the performance by biasing the\nmodel's output distribution toward paths that are more likely to yield rewards,\ntherefore sampling correct responses more efficiently. But this also results in\na narrower reasoning capability boundary compared to base models. Similar\nresults are observed in visual reasoning tasks trained with RLVR. Moreover, we\nfind that distillation can genuinely introduce new knowledge into the model,\ndifferent from RLVR. These findings underscore a critical limitation of RLVR in\nadvancing LLM reasoning abilities which requires us to fundamentally rethink\nthe impact of RL training in reasoning LLMs and the need of a better paradigm.\nProject Page: https://limit-of-RLVR.github.io",
      "categories": [
        "cs.AI",
        "cs.CL",
        "cs.CV"
      ],
      "published": "2025-04-18T17:59:56+00:00",
      "url": "http://arxiv.org/pdf/2504.13837v1",
      "resource_uri": "arxiv://2504.13837v1"
    },
    {
      "id": "2504.13835v1",
      "title": "MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space",
      "authors": [
        "Yicheng Chen",
        "Yining Li",
        "Kai Hu",
        "Zerun Ma",
        "Haochen Ye",
        "Kai Chen"
      ],
      "abstract": "Data quality and diversity are key to the construction of effective\ninstruction-tuning datasets. % With the increasing availability of open-source\ninstruction-tuning datasets, it is advantageous to automatically select\nhigh-quality and diverse subsets from a vast amount of data. % Existing methods\ntypically prioritize instance quality and use heuristic rules to maintain\ndiversity. % However, this absence of a comprehensive view of the entire\ncollection often leads to suboptimal results. % Moreover, heuristic rules\ngenerally focus on distance or clustering within the embedding space, which\nfails to accurately capture the intent of complex instructions in the semantic\nspace. % To bridge this gap, we propose a unified method for quantifying the\ninformation content of datasets. This method models the semantic space by\nconstructing a label graph and quantifies diversity based on the distribution\nof information within the graph. % Based on such a measurement, we further\nintroduce an efficient sampling method that selects data samples iteratively to\n\\textbf{M}aximize the \\textbf{I}nformation \\textbf{G}ain (MIG) in semantic\nspace. % Experiments on various datasets and base models demonstrate that MIG\nconsistently outperforms state-of-the-art methods. % Notably, the model\nfine-tuned with 5\\% Tulu3 data sampled by MIG achieves comparable performance\nto the official SFT model trained on the full dataset, with improvements of\n+5.73\\% on AlpacaEval and +6.89\\% on Wildbench.",
      "categories": [
        "cs.CL",
        "cs.AI"
      ],
      "published": "2025-04-18T17:59:46+00:00",
      "url": "http://arxiv.org/pdf/2504.13835v1",
      "resource_uri": "arxiv://2504.13835v1"
    },
    {
      "id": "2504.13828v1",
      "title": "Generative AI Act II: Test Time Scaling Drives Cognition Engineering",
      "authors": [
        "Shijie Xia",
        "Yiwei Qin",
        "Xuefeng Li",
        "Yan Ma",
        "Run-Ze Fan",
        "Steffi Chern",
        "Haoyang Zou",
        "Fan Zhou",
        "Xiangkun Hu",
        "Jiahe Jin",
        "Yanheng He",
        "Yixin Ye",
        "Yixiu Liu",
        "Pengfei Liu"
      ],
      "abstract": "The first generation of Large Language Models - what might be called \"Act I\"\nof generative AI (2020-2023) - achieved remarkable success through massive\nparameter and data scaling, yet exhibited fundamental limitations in knowledge\nlatency, shallow reasoning, and constrained cognitive processes. During this\nera, prompt engineering emerged as our primary interface with AI, enabling\ndialogue-level communication through natural language. We now witness the\nemergence of \"Act II\" (2024-present), where models are transitioning from\nknowledge-retrieval systems (in latent space) to thought-construction engines\nthrough test-time scaling techniques. This new paradigm establishes a\nmind-level connection with AI through language-based thoughts. In this paper,\nwe clarify the conceptual foundations of cognition engineering and explain why\nthis moment is critical for its development. We systematically break down these\nadvanced approaches through comprehensive tutorials and optimized\nimplementations, democratizing access to cognition engineering and enabling\nevery practitioner to participate in AI's second act. We provide a regularly\nupdated collection of papers on test-time scaling in the GitHub Repository:\nhttps://github.com/GAIR-NLP/cognition-engineering",
      "categories": [
        "cs.CL",
        "cs.AI"
      ],
      "published": "2025-04-18T17:55:58+00:00",
      "url": "http://arxiv.org/pdf/2504.13828v1",
      "resource_uri": "arxiv://2504.13828v1"
    },
    {
      "id": "2504.13822v1",
      "title": "Parameter-Efficient Continual Fine-Tuning: A Survey",
      "authors": [
        "Eric Nuertey Coleman",
        "Luigi Quarantiello",
        "Ziyue Liu",
        "Qinwen Yang",
        "Samrat Mukherjee",
        "Julio Hurtado",
        "Vincenzo Lomonaco"
      ],
      "abstract": "The emergence of large pre-trained networks has revolutionized the AI field,\nunlocking new possibilities and achieving unprecedented performance. However,\nthese models inherit a fundamental limitation from traditional Machine Learning\napproaches: their strong dependence on the \\textit{i.i.d.} assumption hinders\ntheir adaptability to dynamic learning scenarios. We believe the next\nbreakthrough in AI lies in enabling efficient adaptation to evolving\nenvironments -- such as the real world -- where new data and tasks arrive\nsequentially. This challenge defines the field of Continual Learning (CL), a\nMachine Learning paradigm focused on developing lifelong learning neural\nmodels. One alternative to efficiently adapt these large-scale models is known\nParameter-Efficient Fine-Tuning (PEFT). These methods tackle the issue of\nadapting the model to a particular data or scenario by performing small and\nefficient modifications, achieving similar performance to full fine-tuning.\nHowever, these techniques still lack the ability to adjust the model to\nmultiple tasks continually, as they suffer from the issue of Catastrophic\nForgetting. In this survey, we first provide an overview of CL algorithms and\nPEFT methods before reviewing the state-of-the-art on Parameter-Efficient\nContinual Fine-Tuning (PECFT). We examine various approaches, discuss\nevaluation metrics, and explore potential future research directions. Our goal\nis to highlight the synergy between CL and Parameter-Efficient Fine-Tuning,\nguide researchers in this field, and pave the way for novel future research\ndirections.",
      "categories": [
        "cs.LG",
        "cs.AI"
      ],
      "published": "2025-04-18T17:51:51+00:00",
      "url": "http://arxiv.org/pdf/2504.13822v1",
      "resource_uri": "arxiv://2504.13822v1"
    },
    {
      "id": "2504.13818v1",
      "title": "Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning",
      "authors": [
        "Yixuan Even Xu",
        "Yash Savani",
        "Fei Fang",
        "Zico Kolter"
      ],
      "abstract": "Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing\nreasoning capabilities in large language models, but faces a fundamental\nasymmetry in computation and memory requirements: inference is embarrassingly\nparallel with a minimal memory footprint, while policy updates require\nextensive synchronization and are memory-intensive. To address this asymmetry,\nwe introduce PODS (Policy Optimization with Down-Sampling), a framework that\nstrategically decouples these phases by generating numerous rollouts in\nparallel but updating only on an informative subset. Within this framework, we\ndevelop max-variance down-sampling, a theoretically motivated method that\nselects rollouts with maximally diverse reward signals. We prove that this\napproach has an efficient algorithmic solution, and empirically demonstrate\nthat GRPO with PODS using max-variance down-sampling achieves superior\nperformance over standard GRPO on the GSM8K benchmark.",
      "categories": [
        "cs.LG",
        "cs.AI",
        "cs.CL"
      ],
      "published": "2025-04-18T17:49:55+00:00",
      "url": "http://arxiv.org/pdf/2504.13818v1",
      "resource_uri": "arxiv://2504.13818v1"
    },
    {
      "id": "2504.13804v1",
      "title": "Near-optimal algorithms for private estimation and sequential testing of collision probability",
      "authors": [
        "Robert Busa-Fekete",
        "Umar Syed"
      ],
      "abstract": "We present new algorithms for estimating and testing \\emph{collision\nprobability}, a fundamental measure of the spread of a discrete distribution\nthat is widely used in many scientific fields. We describe an algorithm that\nsatisfies $(\\alpha, \\beta)$-local differential privacy and estimates collision\nprobability with error at most $\\epsilon$ using\n$\\tilde{O}\\left(\\frac{\\log(1/\\beta)}{\\alpha^2 \\epsilon^2}\\right)$ samples for\n$\\alpha \\le 1$, which improves over previous work by a factor of\n$\\frac{1}{\\alpha^2}$. We also present a sequential testing algorithm for\ncollision probability, which can distinguish between collision probability\nvalues that are separated by $\\epsilon$ using $\\tilde{O}(\\frac{1}{\\epsilon^2})$\nsamples, even when $\\epsilon$ is unknown. Our algorithms have nearly the\noptimal sample complexity, and in experiments we show that they require\nsignificantly fewer samples than previous methods.",
      "categories": [
        "stat.ML",
        "cs.AI",
        "cs.LG"
      ],
      "published": "2025-04-18T17:12:15+00:00",
      "url": "http://arxiv.org/pdf/2504.13804v1",
      "resource_uri": "arxiv://2504.13804v1"
    },
    {
      "id": "2504.13803v1",
      "title": "Imitation Learning with Precisely Labeled Human Demonstrations",
      "authors": [
        "Yilong Song"
      ],
      "abstract": "Within the imitation learning paradigm, training generalist robots requires\nlarge-scale datasets obtainable only through diverse curation. Due to the\nrelative ease to collect, human demonstrations constitute a valuable addition\nwhen incorporated appropriately. However, existing methods utilizing human\ndemonstrations face challenges in inferring precise actions, ameliorating\nembodiment gaps, and fusing with frontier generalist robot training pipelines.\nIn this work, building on prior studies that demonstrate the viability of using\nhand-held grippers for efficient data collection, we leverage the user's\ncontrol over the gripper's appearance--specifically by assigning it a unique,\neasily segmentable color--to enable simple and reliable application of the\nRANSAC and ICP registration method for precise end-effector pose estimation. We\nshow in simulation that precisely labeled human demonstrations on their own\nallow policies to reach on average 88.1% of the performance of using robot\ndemonstrations, and boost policy performance when combined with robot\ndemonstrations, despite the inherent embodiment gap.",
      "categories": [
        "cs.RO",
        "cs.AI"
      ],
      "published": "2025-04-18T17:12:00+00:00",
      "url": "http://arxiv.org/pdf/2504.13803v1",
      "resource_uri": "arxiv://2504.13803v1"
    },
    {
      "id": "2504.13797v1",
      "title": "Meta-Learning and Knowledge Discovery based Physics-Informed Neural Network for Remaining Useful Life Prediction",
      "authors": [
        "Yu Wang",
        "Shujie Liu",
        "Shuai Lv",
        "Gengshuo Liu"
      ],
      "abstract": "Predicting the remaining useful life (RUL) of rotating machinery is critical\nfor industrial safety and maintenance, but existing methods struggle with\nscarce target-domain data and unclear degradation dynamics. We propose a\nMeta-Learning and Knowledge Discovery-based Physics-Informed Neural Network\n(MKDPINN) to address these challenges. The method first maps noisy sensor data\nto a low-dimensional hidden state space via a Hidden State Mapper (HSM). A\nPhysics-Guided Regulator (PGR) then learns unknown nonlinear PDEs governing\ndegradation evolution, embedding these physical constraints into the PINN\nframework. This integrates data-driven and physics-based approaches. The\nframework uses meta-learning, optimizing across source-domain meta-tasks to\nenable few-shot adaptation to new target tasks. Experiments on industrial data\nand the C-MAPSS benchmark show MKDPINN outperforms baselines in generalization\nand accuracy, proving its effectiveness for RUL prediction under data scarcity",
      "categories": [
        "cs.LG",
        "cs.AI"
      ],
      "published": "2025-04-18T16:58:38+00:00",
      "url": "http://arxiv.org/pdf/2504.13797v1",
      "resource_uri": "arxiv://2504.13797v1"
    },
    {
      "id": "2504.13791v1",
      "title": "Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion",
      "authors": [
        "Sandipan Dhar",
        "Md. Tousin Akhter",
        "Nanda Dulal Jana",
        "Swagatam Das"
      ],
      "abstract": "After demonstrating significant success in image synthesis, Generative\nAdversarial Network (GAN) models have likewise made significant progress in the\nfield of speech synthesis, leveraging their capacity to adapt the precise\ndistribution of target data through adversarial learning processes. Notably, in\nthe realm of State-Of-The-Art (SOTA) GAN-based Voice Conversion (VC) models,\nthere exists a substantial disparity in naturalness between real and\nGAN-generated speech samples. Furthermore, while many GAN models currently\noperate on a single generator discriminator learning approach, optimizing\ntarget data distribution is more effectively achievable through a single\ngenerator multi-discriminator learning scheme. Hence, this study introduces a\nnovel GAN model named Collective Learning Mechanism-based Optimal Transport GAN\n(CLOT-GAN) model, incorporating multiple discriminators, including the Deep\nConvolutional Neural Network (DCNN) model, Vision Transformer (ViT), and\nconformer. The objective of integrating various discriminators lies in their\nability to comprehend the formant distribution of mel-spectrograms, facilitated\nby a collective learning mechanism. Simultaneously, the inclusion of Optimal\nTransport (OT) loss aims to precisely bridge the gap between the source and\ntarget data distribution, employing the principles of OT theory. The\nexperimental validation on VCC 2018, VCTK, and CMU-Arctic datasets confirms\nthat the CLOT-GAN-VC model outperforms existing VC models in objective and\nsubjective assessments.",
      "categories": [
        "cs.SD",
        "cs.AI",
        "eess.AS"
      ],
      "published": "2025-04-18T16:44:01+00:00",
      "url": "http://arxiv.org/pdf/2504.13791v1",
      "resource_uri": "arxiv://2504.13791v1"
    },
    {
      "id": "2504.13787v1",
      "title": "Probabilistic Stability Guarantees for Feature Attributions",
      "authors": [
        "Helen Jin",
        "Anton Xue",
        "Weiqiu You",
        "Surbhi Goel",
        "Eric Wong"
      ],
      "abstract": "Stability guarantees are an emerging tool for evaluating feature\nattributions, but existing certification methods rely on smoothed classifiers\nand often yield conservative guarantees. To address these limitations, we\nintroduce soft stability and propose a simple, model-agnostic, and\nsample-efficient stability certification algorithm (SCA) that provides\nnon-trivial and interpretable guarantees for any attribution. Moreover, we show\nthat mild smoothing enables a graceful tradeoff between accuracy and stability,\nin contrast to prior certification methods that require a more aggressive\ncompromise. Using Boolean function analysis, we give a novel characterization\nof stability under smoothing. We evaluate SCA on vision and language tasks, and\ndemonstrate the effectiveness of soft stability in measuring the robustness of\nexplanation methods.",
      "categories": [
        "cs.LG",
        "cs.AI"
      ],
      "published": "2025-04-18T16:39:08+00:00",
      "url": "http://arxiv.org/pdf/2504.13787v1",
      "resource_uri": "arxiv://2504.13787v1"
    },
    {
      "id": "2504.13785v1",
      "title": "Learning Through Retrospection: Improving Trajectory Prediction for Automated Driving with Error Feedback",
      "authors": [
        "Steffen Hagedorn",
        "Aron Distelzweig",
        "Marcel Hallgarten",
        "Alexandru P. Condurache"
      ],
      "abstract": "In automated driving, predicting trajectories of surrounding vehicles\nsupports reasoning about scene dynamics and enables safe planning for the ego\nvehicle. However, existing models handle predictions as an instantaneous task\nof forecasting future trajectories based on observed information. As time\nproceeds, the next prediction is made independently of the previous one, which\nmeans that the model cannot correct its errors during inference and will repeat\nthem. To alleviate this problem and better leverage temporal data, we propose a\nnovel retrospection technique. Through training on closed-loop rollouts the\nmodel learns to use aggregated feedback. Given new observations it reflects on\nprevious predictions and analyzes its errors to improve the quality of\nsubsequent predictions. Thus, the model can learn to correct systematic errors\nduring inference. Comprehensive experiments on nuScenes and Argoverse\ndemonstrate a considerable decrease in minimum Average Displacement Error of up\nto 31.9% compared to the state-of-the-art baseline without retrospection. We\nfurther showcase the robustness of our technique by demonstrating a better\nhandling of out-of-distribution scenarios with undetected road-users.",
      "categories": [
        "cs.RO",
        "cs.AI",
        "cs.CV"
      ],
      "published": "2025-04-18T16:35:12+00:00",
      "url": "http://arxiv.org/pdf/2504.13785v1",
      "resource_uri": "arxiv://2504.13785v1"
    },
    {
      "id": "2504.13774v1",
      "title": "DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLMs",
      "authors": [
        "Tamim Al Mahmud",
        "Najeeb Jebreel",
        "Josep Domingo-Ferrer",
        "David Sanchez"
      ],
      "abstract": "Large language models (LLMs) have recently revolutionized language processing\ntasks but have also brought ethical and legal issues. LLMs have a tendency to\nmemorize potentially private or copyrighted information present in the training\ndata, which might then be delivered to end users at inference time. When this\nhappens, a naive solution is to retrain the model from scratch after excluding\nthe undesired data. Although this guarantees that the target data have been\nforgotten, it is also prohibitively expensive for LLMs. Approximate unlearning\noffers a more efficient alternative, as it consists of ex post modifications of\nthe trained model itself to prevent undesirable results, but it lacks\nforgetting guarantees because it relies solely on empirical evidence. In this\nwork, we present DP2Unlearning, a novel LLM unlearning framework that offers\nformal forgetting guarantees at a significantly lower cost than retraining from\nscratch on the data to be retained. DP2Unlearning involves training LLMs on\ntextual data protected using {\\epsilon}-differential privacy (DP), which later\nenables efficient unlearning with the guarantees against disclosure associated\nwith the chosen {\\epsilon}. Our experiments demonstrate that DP2Unlearning\nachieves similar model performance post-unlearning, compared to an LLM\nretraining from scratch on retained data -- the gold standard exact unlearning\n-- but at approximately half the unlearning cost. In addition, with a\nreasonable computational cost, it outperforms approximate unlearning methods at\nboth preserving the utility of the model post-unlearning and effectively\nforgetting the targeted information.",
      "categories": [
        "cs.LG",
        "cs.AI"
      ],
      "published": "2025-04-18T16:22:20+00:00",
      "url": "http://arxiv.org/pdf/2504.13774v1",
      "resource_uri": "arxiv://2504.13774v1"
    },
    {
      "id": "2504.13763v1",
      "title": "Decoding Vision Transformers: the Diffusion Steering Lens",
      "authors": [
        "Ryota Takatsuki",
        "Sonia Joseph",
        "Ippei Fujisawa",
        "Ryota Kanai"
      ],
      "abstract": "Logit Lens is a widely adopted method for mechanistic interpretability of\ntransformer-based language models, enabling the analysis of how internal\nrepresentations evolve across layers by projecting them into the output\nvocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is\ntechnically straightforward, its direct use faces limitations in capturing the\nrichness of visual representations. Building on the work of Toker et al.\n(2024)~\\cite{Toker2024-ve}, who introduced Diffusion Lens to visualize\nintermediate representations in the text encoders of text-to-image diffusion\nmodels, we demonstrate that while Diffusion Lens can effectively visualize\nresidual stream representations in image encoders, it fails to capture the\ndirect contributions of individual submodules. To overcome this limitation, we\npropose \\textbf{Diffusion Steering Lens} (DSL), a novel, training-free approach\nthat steers submodule outputs and patches subsequent indirect contributions. We\nvalidate our method through interventional studies, showing that DSL provides\nan intuitive and reliable interpretation of the internal processing in ViTs.",
      "categories": [
        "cs.CV",
        "cs.AI"
      ],
      "published": "2025-04-18T16:00:53+00:00",
      "url": "http://arxiv.org/pdf/2504.13763v1",
      "resource_uri": "arxiv://2504.13763v1"
    },
    {
      "id": "2504.13756v1",
      "title": "Scaling sparse feature circuit finding for in-context learning",
      "authors": [
        "Dmitrii Kharlapenko",
        "Stepan Shabalin",
        "Fazl Barez",
        "Arthur Conmy",
        "Neel Nanda"
      ],
      "abstract": "Sparse autoencoders (SAEs) are a popular tool for interpreting large language\nmodel activations, but their utility in addressing open questions in\ninterpretability remains unclear. In this work, we demonstrate their\neffectiveness by using SAEs to deepen our understanding of the mechanism behind\nin-context learning (ICL). We identify abstract SAE features that (i) encode\nthe model's knowledge of which task to execute and (ii) whose latent vectors\ncausally induce the task zero-shot. This aligns with prior work showing that\nICL is mediated by task vectors. We further demonstrate that these task vectors\nare well approximated by a sparse sum of SAE latents, including these\ntask-execution features. To explore the ICL mechanism, we adapt the sparse\nfeature circuits methodology of Marks et al. (2024) to work for the much larger\nGemma-1 2B model, with 30 times as many parameters, and to the more complex\ntask of ICL. Through circuit finding, we discover task-detecting features with\ncorresponding SAE latents that activate earlier in the prompt, that detect when\ntasks have been performed. They are causally linked with task-execution\nfeatures through the attention and MLP sublayers.",
      "categories": [
        "cs.LG",
        "cs.AI",
        "cs.CL"
      ],
      "published": "2025-04-18T15:45:30+00:00",
      "url": "http://arxiv.org/pdf/2504.13756v1",
      "resource_uri": "arxiv://2504.13756v1"
    },
    {
      "id": "2504.13754v1",
      "title": "Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis",
      "authors": [
        "Zhu Zhu",
        "Shuo Jiang",
        "Jingyuan Zheng",
        "Yawen Li",
        "Yifei Chen",
        "Manli Zhao",
        "Weizhong Gu",
        "Feiwei Qin",
        "Jinhu Wang",
        "Gang Yu"
      ],
      "abstract": "Neuroblastoma, adrenal-derived, is among the most common pediatric solid\nmalignancies, characterized by significant clinical heterogeneity. Timely and\naccurate pathological diagnosis from hematoxylin and eosin-stained whole slide\nimages is critical for patient prognosis. However, current diagnostic practices\nprimarily rely on subjective manual examination by pathologists, leading to\ninconsistent accuracy. Existing automated whole slide image classification\nmethods encounter challenges such as poor interpretability, limited feature\nextraction capabilities, and high computational costs, restricting their\npractical clinical deployment. To overcome these limitations, we propose\nCMSwinKAN, a contrastive-learning-based multi-scale feature fusion model\ntailored for pathological image classification, which enhances the Swin\nTransformer architecture by integrating a Kernel Activation Network within its\nmultilayer perceptron and classification head modules, significantly improving\nboth interpretability and accuracy. By fusing multi-scale features and\nleveraging contrastive learning strategies, CMSwinKAN mimics clinicians'\ncomprehensive approach, effectively capturing global and local tissue\ncharacteristics. Additionally, we introduce a heuristic soft voting mechanism\nguided by clinical insights to seamlessly bridge patch-level predictions to\nwhole slide image-level classifications. We validate CMSwinKAN on the PpNTs\ndataset, which was collaboratively established with our partner hospital and\nthe publicly accessible BreakHis dataset. Results demonstrate that CMSwinKAN\nperforms better than existing state-of-the-art pathology-specific models\npre-trained on large datasets. Our source code is available at\nhttps://github.com/JSLiam94/CMSwinKAN.",
      "categories": [
        "cs.CV",
        "cs.AI"
      ],
      "published": "2025-04-18T15:39:46+00:00",
      "url": "http://arxiv.org/pdf/2504.13754v1",
      "resource_uri": "arxiv://2504.13754v1"
    },
    {
      "id": "2504.13751v1",
      "title": "A Survey for What Developers Require in AI-powered Tools that Aid in Component Selection in CBSD",
      "authors": [
        "Mahdi Jaberzadeh Ansari",
        "Ann Barcomb"
      ],
      "abstract": "Although it has been more than four decades that the first components-based\nsoftware development (CBSD) studies were conducted, there is still no standard\nmethod or tool for component selection which is widely accepted by the\nindustry. The gulf between industry and academia contributes to the lack of an\naccepted tool. We conducted a mixed methods survey of nearly 100 people engaged\nin component-based software engineering practice or research to better\nunderstand the problems facing industry, how these needs could be addressed,\nand current best practices employed in component selection. We also sought to\nidentify and prioritize quality criteria for component selection from an\nindustry perspective. In response to the call for CBSD component selection\ntools to incorporate recent technical advances, we also explored the\nperceptions of professionals about AI-driven tools, present and envisioned.",
      "categories": [
        "cs.SE",
        "cs.AI",
        "cs.CY"
      ],
      "published": "2025-04-18T15:35:31+00:00",
      "url": "http://arxiv.org/pdf/2504.13751v1",
      "resource_uri": "arxiv://2504.13751v1"
    },
    {
      "id": "2504.13745v1",
      "title": "ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis",
      "authors": [
        "Andrea Rigo",
        "Luca Stornaiuolo",
        "Mauro Martino",
        "Bruno Lepri",
        "Nicu Sebe"
      ],
      "abstract": "Diffusion models have revolutionized text-to-image (T2I) synthesis, producing\nhigh-quality, photorealistic images. However, they still struggle to properly\nrender the spatial relationships described in text prompts. To address the lack\nof spatial information in T2I generations, existing methods typically use\nexternal network conditioning and predefined layouts, resulting in higher\ncomputational costs and reduced flexibility. Our approach builds upon a curated\ndataset of spatially explicit prompts, meticulously extracted and synthesized\nfrom LAION-400M to ensure precise alignment between textual descriptions and\nspatial layouts. Alongside this dataset, we present ESPLoRA, a flexible\nfine-tuning framework based on Low-Rank Adaptation, specifically designed to\nenhance spatial consistency in generative models without increasing generation\ntime or compromising the quality of the outputs. In addition to ESPLoRA, we\npropose refined evaluation metrics grounded in geometric constraints, capturing\n3D spatial relations such as \\textit{in front of} or \\textit{behind}. These\nmetrics also expose spatial biases in T2I models which, even when not fully\nmitigated, can be strategically exploited by our TORE algorithm to further\nimprove the spatial consistency of generated images. Our method outperforms the\ncurrent state-of-the-art framework, CoMPaSS, by 13.33% on established spatial\nconsistency benchmarks.",
      "categories": [
        "cs.CV",
        "cs.AI",
        "I.4.0"
      ],
      "published": "2025-04-18T15:21:37+00:00",
      "url": "http://arxiv.org/pdf/2504.13745v1",
      "resource_uri": "arxiv://2504.13745v1"
    },
    {
      "id": "2504.13730v1",
      "title": "Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence",
      "authors": [
        "Paul K. Mandal",
        "Cole Leo",
        "Connor Hurley"
      ],
      "abstract": "Open-source intelligence provides a stream of unstructured textual data that\ncan inform assessments of territorial control. We present CONTACT, a framework\nfor territorial control prediction using large language models (LLMs) and\nminimal supervision. We evaluate two approaches: SetFit, an embedding-based\nfew-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a\nmultilingual generative LLM. Our model is trained on a small hand-labeled\ndataset of news articles covering ISIS activity in Syria and Iraq, using\nprompt-conditioned extraction of control-relevant signals such as military\noperations, casualties, and location references. We show that the BLOOMZ-based\nmodel outperforms the SetFit baseline, and that prompt-based supervision\nimproves generalization in low-resource settings. CONTACT demonstrates that\nLLMs fine-tuned using few-shot methods can reduce annotation burdens and\nsupport structured inference from open-ended OSINT streams. Our code is\navailable at https://github.com/PaulKMandal/CONTACT/.",
      "categories": [
        "cs.CL",
        "cs.AI",
        "cs.LG",
        "I.2.7; I.2.6; I.2.8; H.3.1; K.4.1"
      ],
      "published": "2025-04-18T14:57:07+00:00",
      "url": "http://arxiv.org/pdf/2504.13730v1",
      "resource_uri": "arxiv://2504.13730v1"
    },
    {
      "id": "2504.13717v1",
      "title": "Human-aligned Deep Learning: Explainability, Causality, and Biological Inspiration",
      "authors": [
        "Gianluca Carloni"
      ],
      "abstract": "This work aligns deep learning (DL) with human reasoning capabilities and\nneeds to enable more efficient, interpretable, and robust image classification.\nWe approach this from three perspectives: explainability, causality, and\nbiological vision. Introduction and background open this work before diving\ninto operative chapters. First, we assess neural networks' visualization\ntechniques for medical images and validate an explainable-by-design method for\nbreast mass classification. A comprehensive review at the intersection of XAI\nand causality follows, where we introduce a general scaffold to organize past\nand future research, laying the groundwork for our second perspective. In the\ncausality direction, we propose novel modules that exploit feature\nco-occurrence in medical images, leading to more effective and explainable\npredictions. We further introduce CROCODILE, a general framework that\nintegrates causal concepts, contrastive learning, feature disentanglement, and\nprior knowledge to enhance generalization. Lastly, we explore biological\nvision, examining how humans recognize objects, and propose CoCoReco, a\nconnectivity-inspired network with context-aware attention mechanisms. Overall,\nour key findings include: (i) simple activation maximization lacks insight for\nmedical imaging DL models; (ii) prototypical-part learning is effective and\nradiologically aligned; (iii) XAI and causal ML are deeply connected; (iv) weak\ncausal signals can be leveraged without a priori information to improve\nperformance and interpretability; (v) our framework generalizes across medical\ndomains and out-of-distribution data; (vi) incorporating biological circuit\nmotifs improves human-aligned recognition. This work contributes toward\nhuman-aligned DL and highlights pathways to bridge the gap between research and\nclinical adoption, with implications for improved trust, diagnostic accuracy,\nand safe deployment.",
      "categories": [
        "cs.CV",
        "cs.AI",
        "cs.LG",
        "eess.IV",
        "q-bio.NC",
        "I.2; I.2.6; I.4; I.4.7; I.5; J.3; J.6"
      ],
      "published": "2025-04-18T14:40:58+00:00",
      "url": "http://arxiv.org/pdf/2504.13717v1",
      "resource_uri": "arxiv://2504.13717v1"
    },
    {
      "id": "2504.13707v1",
      "title": "OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation",
      "authors": [
        "Yichen Wu",
        "Xudong Pan",
        "Geng Hong",
        "Min Yang"
      ],
      "abstract": "As the general capabilities of large language models (LLMs) improve and agent\napplications become more widespread, the underlying deception risks urgently\nrequire systematic evaluation and effective oversight. Unlike existing\nevaluation which uses simulated games or presents limited choices, we introduce\nOpenDeception, a novel deception evaluation framework with an open-ended\nscenario dataset. OpenDeception jointly evaluates both the deception intention\nand capabilities of LLM-based agents by inspecting their internal reasoning\nprocess. Specifically, we construct five types of common use cases where LLMs\nintensively interact with the user, each consisting of ten diverse, concrete\nscenarios from the real world. To avoid ethical concerns and costs of high-risk\ndeceptive interactions with human testers, we propose to simulate the\nmulti-turn dialogue via agent simulation. Extensive evaluation of eleven\nmainstream LLMs on OpenDeception highlights the urgent need to address\ndeception risks and security concerns in LLM-based agents: the deception\nintention ratio across the models exceeds 80%, while the deception success rate\nsurpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do\nexhibit a higher risk of deception, which calls for more alignment efforts on\ninhibiting deceptive behaviors.",
      "categories": [
        "cs.AI",
        "cs.CL"
      ],
      "published": "2025-04-18T14:11:27+00:00",
      "url": "http://arxiv.org/pdf/2504.13707v1",
      "resource_uri": "arxiv://2504.13707v1"
    }
  ]
}
AIカテゴリで検索した結果も確認しましたが、やはり「Model Context Protocol」という特定のプロトコルに直接関連する論文は見つかりませんでした。検索結果には様々なAI研究の論文が含まれていますが、Model Context Protocolという特定の技術や方法論に焦点を当てた論文はないようです。
検索結果には以下のような研究トピックが含まれています：
強化学習によるLLMの推論能力向上に関する研究
指示チューニングのためのデータ選択手法
認知工学に関するサーベイ
パラメータ効率の良い継続的学習の調査
LLMにおけるロールアウトのダウンサンプリング手法
衝突確率の推定と検定のアルゴリズム
模倣学習のための人間のデモンストレーション
ビジョントランスフォーマーの解釈性向上手法
LLMの非学習（アンラーニング）フレームワーク
テキスト・画像生成モデルに関する研究
もし「Model Context Protocol」という特定の技術や方法論に関する論文をお探しでしたら、この用語が比較的新しいか、または一般的でないかもしれません。または、異なる名称で知られている可能性もあります。
代わりに、以下のような関連するキーワードで検索すると、より的確な結果が得られるかもしれません：
"context window extension"
"context length extension"
"long context modeling"
"transformer context mechanisms"
"model context handling"
もう少し具体的に探したい内容をお知らせいただければ、お手伝いできるかもしれません。
AIカテゴリーに絞ってもダメでした。

20件取得だと、一番古いものでも 2025/04/18 23:11:27 JSTまでですね。
 まとめArXiv用のMCPサーバー arxiv-mcp-server で論文を検索してみました。

論文数が多いのもあって検索は一筋縄ではいかないみたいです。
 おまけarxiv-mcp-serverの Tool search_papersの定義を見てみます。
https://github.com/blazickjp/arxiv-mcp-server/blob/main/src/arxiv_mcp_server/tools/search.py#L13-L27
今回の検証では max_results と categories までを試しました。
他に date_from と date_to があるので、少しずつ期間をいじっていけばもう少し検索できたのかもしれません。
ArXivの論文を検索できるMCPサーバーを試してみる

arxiv-mcp-server

MCPサーバーを設定する

使ってみる

Model Context Protocolについての論文を探してみる

プロンプト

結果

AIカテゴリに絞って検索してみる

プロンプト

結果

まとめ

おまけ

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社