| AI Agent Frameworks & Development |
AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents |
https://arxiv.org/pdf/2502.05957 |
| AI Agent Frameworks & Development |
Building effective agents |
https://www.anthropic.com/engineering/building-effective-agents |
| AI Agent Frameworks & Development |
OpenAgents: An Open Platform for Language Agents in the Wild |
https://arxiv.org/pdf/2310.10634 |
| AI Agent Frameworks & Development |
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research |
https://arxiv.org/pdf/2502.04644 |
| AI Agent Frameworks & Development |
AutoGLM: Autonomous Foundation Agents for GUIs |
https://arxiv.org/pdf/2411.00820 |
| AI Agent Frameworks & Development |
TapeAgents: A Holistic Framework for Agent Development and Optimization |
https://arxiv.org/pdf/2412.08445 |
| AI Agent Frameworks & Development |
How to think about agent frameworks |
https://blog.langchain.dev/how-to-think-about-agent-frameworks/ |
| AI for Scientific Research |
Towards an AI Co-Scientist |
https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf |
| AI for Scientific Research |
DeepResearcher: Scaling Deep Research via Reinforcement Learning |
https://arxiv.org/pdf/2504.03160 |
| AI for Scientific Research |
AI Achieves Silver-Medal Standard Solving IMO Problems |
https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ |
| AI for Scientific Research |
Accelerating Scientific Research Through Multi-LLM Frameworks |
https://arxiv.org/pdf/2502.07960 |
| AI for Scientific Research |
The AI Scientist: Fully Automated Open-Ended Scientific Discovery |
https://arxiv.org/pdf/2408.06292 |
| AI for Scientific Research |
Transforming Science with LLMs: Survey on AI-Assisted Discovery |
https://arxiv.org/pdf/2502.05151 |
| AI for Scientific Research |
AI's Deep Research Revolution in Biomedical Literature |
https://journals.lww.com/jcma/citation/9900/ai_s_deep_research_revolution__transforming.508.aspx |
| AI for Scientific Research |
Unlocking AI Researchers' Potential in Scientific Discovery |
https://arxiv.org/pdf/2503.05822 |
| AI for Scientific Research |
Empowering Biomedical Discovery with AI Agents |
https://arxiv.org/pdf/2404.02831 |
| AI for Scientific Research |
Automated Scientific Discovery Systems |
https://arxiv.org/abs/2305.02251 |
| LLM Tool Integration & API Control |
ToolLLM: Mastering 16K+ Real-World APIs |
https://arxiv.org/pdf/2307.16789 |
| LLM Tool Integration & API Control |
MetaGPT: Multi-Agent Collaborative Framework |
https://arxiv.org/pdf/2308.00352 |
| LLM Tool Integration & API Control |
AutoGen: Next-Gen LLM Apps via Multi-Agent Conversation |
https://arxiv.org/pdf/2308.08155 |
| LLM Tool Integration & API Control |
LLaVA-Plus: Creating Multimodal Agents with Tools |
https://arxiv.org/pdf/2311.05437 |
| LLM Tool Integration & API Control |
ChemCrow: Augmenting LLMs with Chemistry Tools |
https://arxiv.org/pdf/2304.05376 |
| LLM Tool Integration & API Control |
TORL: Scaling Tool-Integrated Reinforcement Learning |
https://arxiv.org/pdf/2503.23383 |
| Deep Research Systems |
OpenAI's 'Deep Research' Tool: Usefulness for Scientists |
https://www.nature.com/articles/d41586-025-00377-9 |
| Deep Research Systems |
OpenAI's Deep Research: Functionality and Applications |
https://www.youreverydayai.com/openais-deep-research-how-it-works-and-what-to-use-it-for/ |
| Deep Research Systems |
Deep Research System Card |
https://cdn.openai.com/deep-research-system-card.pdf |
| Deep Research Systems |
Gemini Launches Deep Research on Gemini 2.5 Pro |
https://www.ctol.digital/news/gemini-deep-research-launch-2-5-pro-vs-openai/ |
| Deep Research Systems |
Deep Research Now Available on Gemini 2.5 Pro Experimental |
https://blog.google/products/gemini/deep-research-gemini-2-5-pro-experimental/ |
| Deep Research Systems |
ChatGPT's Deep Research vs. Google's Gemini 1.5 Pro: Comparison |
https://whitebeardstrategies.com/ai-prompt-engineering/chatgpts-deep-research-vs-googles-gemini-1-5-pro-with-deep-research-a-detailed-comparison/ |
| Deep Research Systems |
ChatGPT Deep Research vs Perplexity: Comparative Analysis |
https://blog.getbind.co/2025/02/03/chatgpt-deep-research-is-it-better-than-perplexity/ |
| Deep Research Systems |
Sonar by Perplexity [Technical Documentation] |
https://docs.perplexity.ai/guides/model-cards#research-models |
| RAG Technology |
Ragnarök: Reusable RAG Framework for TREC 2024 |
http://arxiv.org/pdf/2406.16828 |
| RAG Technology |
From Documents to Dialogue: KG-RAG Enhanced AI Assistants |
https://arxiv.org/pdf/2502.15237 |
| RAG Technology |
GEAR-Up: AI-Augmented Scholarly Search for Systematic Reviews |
https://arxiv.org/pdf/2312.09948 |
| RAG Technology |
Survey on RAG for Large Language Models |
https://arxiv.org/pdf/2405.06211 |
| RAG Technology |
Knowledge Retrieval Based on Generative AI |
https://arxiv.org/pdf/2501.04635 |
| LLM Reasoning & Optimization |
Self-Consistency Improves Chain-of-Thought Reasoning |
https://arxiv.org/pdf/2203.11171 |
| LLM Reasoning & Optimization |
Chain-of-Thought Prompting Elicits Reasoning in LLMs |
https://arxiv.org/pdf/2201.11903 |
| LLM Reasoning & Optimization |
Training LLMs to Follow Instructions with Human Feedback |
https://arxiv.org/pdf/2203.02155 |
| LLM Reasoning & Optimization |
Debate Enhances Weak-to-Strong Generalization |
https://arxiv.org/pdf/2501.13124 |
| LLM Reasoning & Optimization |
Mask-DPO: Factuality Alignment for LLMs |
https://arxiv.org/pdf/2503.02846 |
| LLM Reasoning & Optimization |
QuestBench: Can LLMs Ask Optimal Questions? |
https://arxiv.org/abs/2503.22674 |
| Multi-Agent Systems |
AgentVerse: Multi-Agent Collaboration and Emergent Behaviors |
https://arxiv.org/pdf/2308.10848 |
| Multi-Agent Systems |
MetaAgents: Human Behavior Simulation for Task Coordination |
https://arxiv.org/pdf/2310.06500 |
| Multi-Agent Systems |
CAMEL: Communicative Agents for LLM Society Exploration |
https://arxiv.org/pdf/2303.17760 |
| Multi-Agent Systems |
Many Heads Improve Scientific Idea Generation |
https://arxiv.org/pdf/2410.09403 |
| Multi-Agent Systems |
Why Multi-Agent LLM Systems Fail |
https://arxiv.org/pdf/2503.13657 |
| Multi-Agent Systems |
Multi-Agent System for Cosmological Parameter Analysis |
https://arxiv.org/pdf/2412.00431 |
| Code & Software Development |
CodeA11y: Accessible Web Development with AI |
https://arxiv.org/pdf/2502.10884 |
| Code & Software Development |
AutoDev: Automated AI-Driven Development |
https://arxiv.org/pdf/2403.08299 |
| Code & Software Development |
ChatDev: Communicative Agents for Software Development |
https://aclanthology.org/2024.acl-long.810.pdf |
| Code & Software Development |
Natural Language as a Programming Language |
https://drops.dagstuhl.de/storage/00lipics/lipics-vol071-snapl2017/LIPIcs.SNAPL.2017.4/LIPIcs.SNAPL.2017.4.pdf |
| Code & Software Development |
AIDE: AI-Driven Code Exploration |
https://arxiv.org/pdf/2502.13138 |
| Code & Software Development |
AI-Assisted Programming: Big Code NLP |
https://arxiv.org/pdf/2307.02503 |
| Code & Software Development |
AI-Assisted SQL Authoring at Industry Scale |
https://arxiv.org/pdf/2407.13280 |
| Code & Software Development |
Steward: Natural Language Web Automation |
https://arxiv.org/pdf/2409.15441 |
| Domain-Specific AI Tools |
MatPilot: AI Materials Scientist |
https://arxiv.org/pdf/2411.08063 |
| Domain-Specific AI Tools |
EvoPat: Multi-LLM Patent Summarization Agent |
https://arxiv.org/pdf/2412.18100 |
| Domain-Specific AI Tools |
ChartCitor: Fine-Grained Chart Attribution Framework |
https://arxiv.org/pdf/2502.00989 |
| Domain-Specific AI Tools |
PatentGPT: Knowledge-Based Patent Drafting |
https://arxiv.org/pdf/2409.00092 |
| Domain-Specific AI Tools |
SciAgents: Multi-Agent Scientific Discovery |
https://arxiv.org/pdf/2409.05556 |
| Domain-Specific AI Tools |
Dolphin: Closed-Loop Open-Ended Auto-Research |
https://arxiv.org/pdf/2501.03916 |
| Domain-Specific AI Tools |
SeqMate: Automating RNA Sequencing with LLMs |
https://arxiv.org/pdf/2407.03381 |
| Domain-Specific AI Tools |
Knowledge Synthesis of Photosynthesis via LLMs |
https://arxiv.org/pdf/2502.01059 |
| Domain-Specific AI Tools |
GeoLLM: Geospatial Knowledge Extraction from LLMs |
https://arxiv.org/pdf/2310.06213 |
| HCI & AI User Experience |
System Usability Scale: Evolution and Future |
https://doi.org/10.1080/10447318.2018.1455307 |
| HCI & AI User Experience |
CARE: Collaborative AI Reading Environment |
https://arxiv.org/pdf/2302.12611 |
| HCI & AI User Experience |
VISAR: Visual Argumentative Writing Assistant |
https://arxiv.org/pdf/2304.07810 |
| HCI & AI User Experience |
AdaptoML-UX: User-Centered AutoML Toolkit |
https://arxiv.org/pdf/2410.17469 |
| HCI & AI User Experience |
AI Assistants for Semi-Automated Data Wrangling |
https://arxiv.org/pdf/2211.00192 |
| HCI & AI User Experience |
Documentation Matters: Human-Centered AI Systems |
https://arxiv.org/pdf/2102.12592 |
| HCI & AI User Experience |
Need Help? Proactive Programming Assistants |
https://arxiv.org/abs/2410.04596 |
| HCI & AI User Experience |
Large-Scale Survey on AI Programming Assistant Usability |
https://arxiv.org/abs/2303.17125 |
| AI Evaluation & Benchmarking |
TruthfulQA: Measuring Model Mimicry of Human Falsehoods |
https://arxiv.org/pdf/2109.07958 |
| AI Evaluation & Benchmarking |
HotpotQA: Dataset for Multi-hop Question Answering |
https://arxiv.org/pdf/1809.09600 |
| AI Evaluation & Benchmarking |
WebArena: Web Agent Benchmark |
https://github.com/web-arena-x/webarena |
| AI Evaluation & Benchmarking |
Measuring Short-Form Factuality in LLMs |
https://cdn.openai.com/papers/simpleqa.pdf |
| AI Evaluation & Benchmarking |
Survey on LLM-Generated Text Detection |
https://arxiv.org/pdf/2310.14724 |
| AI Evaluation & Benchmarking |
Evaluating AI-Assisted Code Generation Tools |
https://arxiv.org/pdf/2304.10778 |
| AI Evaluation & Benchmarking |
Benchmarking ChatGPT, Codeium, and GitHub Copilot |
https://arxiv.org/pdf/2409.19922 |
| AI Evaluation & Benchmarking |
FinEval: Chinese Financial Knowledge Benchmark |
https://arxiv.org/pdf/2308.09975 |
| AI Evaluation & Benchmarking |
Knowledge-Based Evaluation Methodology for AI Assistants |
https://arxiv.org/pdf/2406.05603 |
| AI Evaluation & Benchmarking |
GRADE Guidelines: Rating Evidence Quality |
https://pubmed.ncbi.nlm.nih.gov/21208779/ |
| AI Evaluation & Benchmarking |
Holistic Evaluation of Language Models |
https://arxiv.org/pdf/2211.09110 |
| AI Evaluation & Benchmarking |
AGIEvalA Human-Centric Benchmark for Evaluating Foundation Models |
https://arxiv.org/pdf/2304.06364 |
| AI Evaluation & Benchmarking |
GAIA:A Benchmark for General AI Assistants |
https://arxiv.org/pdf/2311.12983 |
| AI Evaluation & Benchmarking |
MMLU benchmarkTesting LLMs multi-task capabilities |
https://www.bracai.eu/post/mmlu-benchmark |
| AI Evaluation & Benchmarking |
Enabling AI Scientists to Recognize InnovationA Domain-Agnostic Algorithm for Assessing Novelty |
https://arxiv.org/pdf/2503.01508 |
| AI Evaluation & Benchmarking |
The impact of AI and peer feedback on research writing skillsa study using the CGScholar platform among Kazakhstani scholars |
https://arxiv.org/pdf/2503.05820 |
| AI Evaluation & Benchmarking |
Supporting the development of Machine Learning for fundamental science in a federated Cloud with the AI_INFN platform |
https://arxiv.org/pdf/2502.21266 |
| AI Evaluation & Benchmarking |
EAIRAEstablishing a Methodology for Evaluating AI Models as Scientific Research Assistants |
https://arxiv.org/pdf/2502.20309 |
| AI Evaluation & Benchmarking |
Bridging Logic Programming and Deep Learning for Explainability through ILASP |
https://arxiv.org/pdf/2502.09227 |
| AI Evaluation & Benchmarking |
Self-Explanation in Social AI Agents |
https://arxiv.org/pdf/2501.13945 |
| AI Evaluation & Benchmarking |
Fine-Grained Appropriate RelianceHuman-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition |
https://arxiv.org/pdf/2501.10909 |
| AI Evaluation & Benchmarking |
CATERLeveraging LLM to Pioneer a Multidimensional, Reference-Independent Paradigm in Translation Quality Evaluation |
https://arxiv.org/pdf/2412.11261 |
| AI Evaluation & Benchmarking |
GigaCheckDetecting LLM-generated Content |
https://arxiv.org/pdf/2410.23728 |
| AI Evaluation & Benchmarking |
Vital InsightAssisting Experts' Context-Driven Sensemaking of Multi-modal Personal Tracking Data Using Visualization and Human-In-The-Loop LLM Agents |
https://arxiv.org/pdf/2410.14879 |
| AI Evaluation & Benchmarking |
Aligning AI-driven discovery with human intuition |
https://arxiv.org/pdf/2410.07 |
| AI Evaluation & Benchmarking |
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics |
https://arxiv.org/pdf/2502.15815 |
| AI Evaluation & Benchmarking |
Insect-FoundationA Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding |
https://arxiv.org/pdf/2502.09906 |
| AI Evaluation & Benchmarking |
MinervaA Programmable Memory Test Benchmark for Language Models |
https://arxiv.org/pdf/2502.03358 |
| AI Evaluation & Benchmarking |
UGPhysicsA Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models |
https://arxiv.org/pdf/2502.00334 |
| AI Evaluation & Benchmarking |
Learning to Coordinate with Experts |
https://arxiv.org/pdf/2502.09583 |
| AI Evaluation & Benchmarking |
Auto-BenchAn Automated Benchmark for Scientific Discovery in LLMs |
https://arxiv.org/pdf/2502.15224 |
| AI Evaluation & Benchmarking |
How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation |
https://arxiv.org/pdf/2412.18573 |
| AI Evaluation & Benchmarking |
LLM4DSEvaluating Large Language Models for Data Science Code Generation |
https://arxiv.org/pdf/2411.11908 |
| AI Evaluation & Benchmarking |
RedCodeRisky Code Execution and Generation Benchmark for Code Agents |
https://arxiv.org/pdf/2411.07781 |
| AI Evaluation & Benchmarking |
SeafloorAIA Large-scale Vision-Language Dataset for Seafloor Geological Survey |
https://arxiv.org/pdf/2411.00172 |
| AI Evaluation & Benchmarking |
INQUIREA Natural World Text-to-Image Retrieval Benchmark |
https://arxiv.org/pdf/2411.02537 |
| AI Evaluation & Benchmarking |
AAAR-1.0Assessing AI's Potential to Assist Research |
https://arxiv.org/pdf/2410.22394 |
| AI Evaluation & Benchmarking |
AutoPenBenchBenchmarking Generative Agents for Penetration Testing |
https://arxiv.org/pdf/2410.03225 |
| AI Evaluation & Benchmarking |
CodeMMLUA Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs |
https://arxiv.org/pdf/2410.01999 |
| AI Evaluation & Benchmarking |
UniSumEvalTowards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs |
https://arxiv.org/pdf/2409.19898 |
| AI Evaluation & Benchmarking |
CI-BenchBenchmarking Contextual Integrity of AI Assistants on Synthetic Data |
https://arxiv.org/pdf/2409.13903 |
| AI Evaluation & Benchmarking |
ChemDFM-XTowards Large Multimodal Model for Chemistry |
https://arxiv.org/pdf/2409.13194 |
| AI Evaluation & Benchmarking |
DSBenchHow Far Are Data Science Agents to Becoming Data Science Experts? |
https://arxiv.org/pdf/2409.07703 |
| AI Evaluation & Benchmarking |
GMAI-MMBenchA Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI |
https://arxiv.org/pdf/2408.03361 |
| AI Evaluation & Benchmarking |
MMSciA Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding |
https://arxiv.org/pdf/2407.04903 |
| AI Evaluation & Benchmarking |
SciCodeA Research Coding Benchmark Curated by Scientists |
https://arxiv.org/pdf/2407.13168 |
| AI Evaluation & Benchmarking |
MASSWA New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows |
https://arxiv.org/pdf/2406.06357 |
| AI Evaluation & Benchmarking |
Turing Tests For An AI Scientist |
https://arxiv.org/pdf/2405.13352 |
| AI Evaluation & Benchmarking |
LHRS-BotEmpowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model |
https://arxiv.org/pdf/2402.02544 |
| AI Evaluation & Benchmarking |
GAIAa benchmark for General AI Assistants |
https://arxiv.org/pdf/2311.12983 |
| AI Evaluation & Benchmarking |
OceanGPTA Large Language Model for Ocean Science Tasks |
https://arxiv.org/pdf/2310.02031 |
| AI Evaluation & Benchmarking |
LatEvalAn Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles |
https://arxiv.org/pdf/2308.10855 |
| AI Evaluation & Benchmarking |
BOLAABenchmarking and Orchestrating LLM-augmented Autonomous Agents |
https://arxiv.org/pdf/2308.05960 |
| AI Evaluation & Benchmarking |
MegaWikaMillions of reports and their sources across 50 diverse languages |
https://arxiv.org/pdf/2307.07049 |
| AI Evaluation & Benchmarking |
Learn to ExplainMultimodal Reasoning via Thought Chains for Science Question Answering |
https://arxiv.org/pdf/2209.09513 |
| AI Evaluation & Benchmarking |
Benchmarking Agentic Workflow Generation |
https://arxiv.org/abs/2410.07869 |
| AI Evaluation & Benchmarking |
TheAgentCompanyBenchmarking LLM Agents on Consequential Real World Tasks |
https://arxiv.org/abs/2412.14161 |