UK | EN |
LIVE
Технології 🇺🇸 США

Beyond Semantic Similarity

Hacker News 44za12 0 переглядів 7 хв читання
Computer Science > Information Retrieval arXiv:2605.05242 (cs) [Submitted on 3 May 2026] Title:Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction Authors:Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang View a PDF of the paper titled Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction, by Zhuofeng Li and 18 other authors View PDF
Abstract:Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Cite as: arXiv:2605.05242 [cs.IR]
  (or arXiv:2605.05242v1 [cs.IR] for this version)
  https://doi.org/10.48550/arXiv.2605.05242 Focus to learn more arXiv-issued DOI via DataCite

Submission history

From: Zhuofeng Li [view email]
[v1] Sun, 3 May 2026 19:13:11 UTC (5,193 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction, by Zhuofeng Li and 18 other authors
  • View PDF
  • TeX Source
license icon view license

Additional Features

Current browse context:

cs.IR < prev   |   next >
new | recent | 2026-05 Change to browse by: cs
cs.AI

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

× loading... Data provided by:

Bookmark

BibSonomy Reddit Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) About arXivLabs arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
Поділитися

Схожі новини