Skip to content

scienceaix/agentskills

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Awesome Skills for LLMs Awesome

A curated collection of resources, papers, tools, and frameworks for building, composing, and deploying skills for large language models — centered on Skills ecosystem and radiating outward to the broader LLM agent capabilities landscape.

Agent Skills was introduced as composable, portable folders of instructions, scripts, and resources that loads dynamically — turning a general-purpose assistant into a specialized agent. This list tracks everything in the skills-for-LLMs ecosystem: the official Anthropic skill system, the closely related Model Context Protocol (MCP), academic research on skill acquisition and tool use, open-source agent frameworks, computer-use and GUI agents, benchmarks, and practical tutorials.


Contents


Anthropic Skills — Core Ecosystem

Official Announcements & Blog Posts

  • Introducing Agent Skills — Official product launch. Skills are folders of instructions, scripts, and resources that Claude loads dynamically. Available for Pro, Max, Team, and Enterprise users. Updated Dec 18, 2025 with organization-wide management and open standard announcement. (Oct 2025, updated Dec 2025)

  • Equipping Agents for the Real World with Agent Skills — Engineering deep-dive on the Agent Skills architecture: progressive disclosure, SKILL.md format, bundled code execution, and best practices. By Barry Zhang, Keith Lazuka, Mahesh Murag. (Oct 2025, updated Dec 2025)

  • Skills for Organizations, Partners, the Ecosystem — Announcement of org-wide skill management, partner-built skills directory, and Skills as an open standard for cross-platform portability. (Dec 18, 2025)

  • How to Create Skills: Key Steps, Limitations, and Examples — Practical guide to building Skills: defining name/description, structuring SKILL.md, writing instructions, testing, and governance best practices for teams. (Nov 2025)

  • How AI Impacts Skill Formation — RCT finding that AI assistance led to 17% lower mastery scores, exploring the tension between productivity and skill development. (Jan 2026)

Documentation & Guides

GitHub Repositories

Courses & Webinars


Model Context Protocol (MCP)

Specification & Announcements

SDKs & Official Repos

Community MCP Resources

  • punkpeye/awesome-mcp-servers — Comprehensive curated collection of MCP servers: production-ready and experimental, covering file access, databases, API integrations, and more. ⭐ 15k+

  • wong2/awesome-mcp-servers — Curated list of MCP servers with official integrations, reference servers, and community servers by category. ⭐ 10k+

  • microsoft/mcp — Catalog of official Microsoft MCP server implementations including Azure services, DevOps, M365 Agents Toolkit, Fabric, and Sentinel.


Claude Tool Use & Computer Use

Tool Use Resources

  • Introducing Advanced Tool Use on the Claude Developer Platform — Three beta features: Tool Search Tool, Programmatic Tool Calling, and Tool Learning. 85% token reduction; Opus 4.5 improved from 79.5% to 88.1% accuracy. (Nov 24, 2025)

  • Writing Effective Tools for Agents — With Agents — Best practices for designing MCP tools and agent tools: ergonomic design, namespacing, evaluation-driven improvement, using Claude to optimize its own tools. (2025)

  • Building Agents with the Claude Agent SDK — Claude Agent SDK (renamed from Claude Code SDK): tools as primary building blocks, MCP integration, bash tool, code generation patterns. (Sep 29, 2025)

  • Building Effective AI Agents — Foundational guide on agentic system patterns: prompt chaining, routing, parallelization, orchestrator-workers. Distinguishes workflows vs agents. (Dec 2024)

  • Demystifying Evals for AI Agents — Guide to building evaluations for agents including coding agents, research agents, and computer use agents. (2025)

  • Tool Use with Claude — Overview — Main documentation hub: client tools, server tools (web search, text editor, code execution, computer use), structured outputs, MCP integration.

  • How to Implement Tool Use — Step-by-step implementation guide covering tool definitions, tool_use responses, tool_result handling, parallel tool calls, and token-efficient tools.

  • Programmatic Tool Calling — Documentation for the advanced-tool-use-2025-11-20 beta: Claude executes tools programmatically via code execution.

Computer Use Resources

  • Developing a Computer Use Model — Research insights on training Claude for computer use: generalization from simple software, safety considerations, and RSP assessment. (Aug 2025)

  • Monitoring Computer Use via Hierarchical Summarization — Safety research on monitoring Computer Use API activity using hierarchical summarization to detect harmful behaviors at scale. (2025)

  • Computer Use Tool Documentation — Official API docs for computer use: computer_20251124 (Opus 4.5/4.6), computer_20250124 (Sonnet 4.5), screenshot capture, mouse/keyboard control.

  • API Release Notes — Changelog documenting computer_20250124 tool version (Jan 2025), bash_20250124, text_editor_20250124, token-efficient tool use.


Academic Papers

Skill Learning & Composition

  • SAGE: Reinforcement Learning for Self-Improving Agent with Skill Library — Jiongxiao Wang et al. Proposes SAGE (Skill Augmented GRPO for self-Evolution), an RL framework that systematically incorporates a skill library into agent training. Achieves 8.9% higher task completion while requiring 26% fewer steps and 59% fewer tokens on AppWorld. (Dec 2025)

  • CUA-Skill: Develop Skills for Computer Using Agent — Tianyi Chen et al. (Microsoft). Skill-centric framework encoding human computer-use knowledge as reusable, parameterized skills with execution and composition graphs. CUA-Skill Agent achieves SOTA 57.5% success rate on WindowsAgentArena. (Jan 2026)

  • Agentic Proposing: Enhancing LLM Reasoning via Compositional Skill Synthesis — Zhengbo Jiao et al. Models problem synthesis as a goal-driven process with a specialized agent that dynamically selects and composes modular reasoning skills from a skill library. A 30B solver achieves 91.6% on AIME 2025. (Feb 2026)

  • When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail — Xiaoxiao Li. Investigates "compiling" multi-agent systems into single-agent skill libraries, finding substantial reductions in token usage and latency while maintaining accuracy. Discovers a phase transition in skill selection accuracy at a critical library size. (Jan 2026)

  • Self-Distillation Enables Continual Learning — Idan Shenfeld, Mehul Damani et al. Studies whether pretrained LLMs can acquire new, narrowly defined skills (science Q&A, tool use, medical reasoning) without degrading existing abilities, using self-distillation. (Jan 2026)

Tool Use & Function Calling

Computer Use & GUI Agents

Web Agents & Browser Automation

GUI Grounding & Visual Understanding

OS Agents

Multi-Agent Collaboration

Surveys & Overviews


Open-Source Projects & Frameworks

Agent Frameworks

Project Description Stars
LangGraph Graph-based framework for stateful, multi-actor LLM applications with cyclic workflows. Extends LangChain with coordination of multiple chains and agents. ⭐ 80k+
Microsoft AutoGen Event-driven multi-agent conversation framework with customizable agent behaviors and structured conversation flows. ⭐ 40k+
CrewAI Orchestrates role-playing autonomous AI agents that collaborate to accomplish complex tasks with specific roles, goals, and tools. ⭐ 43k+
Microsoft Semantic Kernel Model-agnostic SDK for building and deploying AI agents and multi-agent systems with plugin ecosystems and MCP integration. ⭐ 22k+
Agno (formerly Phidata) High-performance agent framework with multimodal support. Claims ~10,000× faster agent instantiation than LangGraph. ⭐ 20k+
Mastra TypeScript AI agent framework with assistants, RAG, and observability. Built-in tools library. ⭐ 21k+
PydanticAI Schema-driven, type-safe GenAI agent framework by the Pydantic team. Supports MCP/A2A, durable execution. ⭐ 10k+
HuggingFace smolagents Minimalist, code-first agent library (~1,000 lines core). Code agents write actions as Python; Hub integration for sharing. ⭐ 10k+
OpenAI Agents SDK Lightweight Python framework for multi-agent workflows with tracing, guardrails, and handoffs. Provider-agnostic. ⭐ 9k+
Google Agent Development Kit (ADK) Open-source, code-first Python toolkit optimized for Gemini but model-agnostic. Supports A2A protocol. ⭐ 5k+

Browser Automation & Computer Use Agents

Project Description Stars
browser-use Makes websites accessible for AI agents with automated browser interaction. Multi-provider support (OpenAI, Google, Anthropic, Ollama). ⭐ 50k+
Skyvern Browser-based workflow automation using LLMs and computer vision. SOTA on WebBench (64.4%). Multi-agent with planner + validator. ⭐ 20k+
Stagehand (Browserbase) AI browser automation combining natural language and code with act(), agent(), and extract() APIs. Auto-caching with self-healing. ⭐ 10k+
Steel Browser Open-source browser API for AI agents with full Puppeteer/Playwright/Selenium support, session management, and proxy support. ⭐ 5k+
agent-browser (Vercel Labs) Headless browser automation CLI for AI agents. Fast Rust CLI. Works with Claude Code, Cursor, Gemini CLI, and Codex. ⭐ 3k+

Coding Agents

Project Description Stars
OpenHands (formerly OpenDevin) Open platform for AI-driven development with SDK, CLI, GUI, and cloud deployment. SOTA on SWE-Bench Verified. ⭐ 65k+
Aider AI pair programming in terminal. Maps entire codebase for context-aware multi-file edits. Works with Claude, GPT, DeepSeek. ⭐ 40k+
SWE-agent Automatically fixes GitHub issues using your LM of choice. SWE-agent 1.0 + Claude 3.7 achieved SoTA on SWE-Bench. ⭐ 10k+
mini-swe-agent 100-line AI agent scoring >74% on SWE-bench verified. Radically simple — no tools other than bash. ⭐ 2k+

Benchmarks & Evaluation

Agent Benchmarks

  • SWE-bench & SWE-bench Verified — The premier benchmark for LLMs on real-world GitHub issue resolution. 500 human-validated tasks. Top model (Claude Opus 4.6 Thinking) reaches 79.2% on Verified. (GitHub)

  • SWE-bench-Live — Microsoft's live, monthly-updated variant adding 50 newly verified issues per month to prevent data contamination. Multi-language (C, C++, C#, Python, Java, Go, JS/TS, Rust). (NeurIPS 2025)

  • SWE-bench Pro — Scale AI's harder variant using copyleft and proprietary codebases. Best models score only ~23% vs 70%+ on Verified.

  • GAIA: General AI Assistants — 466 real-world questions requiring reasoning, multi-modality, web browsing, and tool-use. Humans: 92%; GPT-4: 15%. (Leaderboard)

  • TheAgentCompany — CMU benchmark simulating a software engineering startup with 175 professional tasks across browsing, coding, terminal use, and communication. Best agent: ~24.4% full completion. (NeurIPS 2025 D&B)

  • AgentBench — Systematic benchmark evaluating LLMs as agents across 8 environments (OS, database, knowledge graph, card game, web shopping, web browsing). (Paper)

  • DABstep: Data Agent Benchmark — By Adyen and Hugging Face, 450+ real-world data analysis tasks. Best reasoning agents achieve only 16% accuracy. (2025)

Tool Use & Function Calling Benchmarks

Computer Use & GUI Benchmarks

  • OSWorld / OSWorld-Verified — First real-computer-environment benchmark for multimodal agents across Ubuntu, Windows, macOS. Human: 72.36%; first agent surpassed this at 72.6% in Dec 2025. (GitHub, Verified Blog)

  • ScreenSpot-Pro — GUI grounding in professional high-resolution settings. 23 applications, 5 industries, 3 OSes. Expert annotations. (Apr 2025)

  • AndroidWorld — Dynamic benchmarking for autonomous mobile GUI agents on real Android systems. 116 core tasks across 20 apps. Leading agents: 60-80%.

  • GUI-World — Video benchmark evaluating multimodal LLMs on dynamic GUI understanding across 6 scenarios. (ICLR 2025 Poster) (OpenReview)

  • OSUniverse — Benchmark for multimodal desktop agents focusing on realistic office worker skills. Humans: 72.36%; leading models: ~42.9%. (May 2025)

Web Browsing Benchmarks

  • WebArena — Self-hosted web environment for autonomous agents with 812 tasks across e-commerce, forums, code dev, CMS. Execution-based evaluation. (ICLR 2024) (GitHub)

  • VisualWebArena — 910 visually grounded web tasks requiring multimodal comprehension across Classifieds, Shopping, and Reddit environments. (ACL 2024) (GitHub)

  • BrowseComp — OpenAI's 1,266 challenging questions requiring persistent, creative web navigation. Deep Research: ~51.5% vs <10% for non-agentic models. (Paper) (Apr 2025)

  • BrowseComp-Plus — Fair evaluation for Deep-Research agents using fixed curated corpus of ~100K human-verified documents. (Paper) (Aug 2025)

  • Mind2Web / Online-Mind2Web — 2,350 tasks across 137 real websites. Online-Mind2Web extends with WebJudge auto-evaluation reaching ~85% agreement with human judgment.

Leaderboards & Aggregators


Tutorials & Educational Resources

Anthropic Official Tutorials

  • Building Effective AI Agents — Foundational guide on agentic patterns: prompt chaining, routing, parallelization, orchestrator-workers. The canonical reference for agent architectures. (Dec 2024)

  • Tool Use with Claude — Documentation — Complete official docs for Claude's tool use API: client tools, server tools, MCP integration, parallel calls, and pricing.

  • Computer Use — Documentation — Official docs for Computer Use: screenshot analysis, mouse/keyboard control, bash and text editor tools, agent loop implementation, and Docker reference.

  • Anthropic Cookbook — Agent Patterns — Reference implementations of common agent workflow patterns from the "Building Effective Agents" post.

  • Demystifying Evals for AI Agents — Practical guide to building evaluations for coding agents, research agents, and computer use agents.

Community Skills Guides

MCP Tutorials

Agent Architecture Guides

Tool Use & Function Calling Tutorials

Courses

  • Agentic AI — DeepLearning.AI — Andrew Ng's flagship course covering four agentic design patterns: Reflection, Tool Use, Planning, and Multi-Agent collaboration. Vendor-neutral, built with raw Python. (Oct 2025)

  • AI Agents in LangGraph — DeepLearning.AI — Building controllable agents with LangGraph, including agentic search integration. (Harrison Chase & Rotem Weiss)

  • HuggingFace Evaluation Guidebook — Comprehensive evaluation resource covering 2025 evaluations for useful models, including agentic benchmarks analysis.


Related Awesome Lists


Key Timeline

Date Milestone
Nov 2024 MCP launched as open standard; Claude Computer Use enters public beta
Dec 2024 "Building Effective Agents" foundational guide published
Jan 2025 Updated computer use tools (computer_20250124); token-efficient tool use
Mar 2025 OpenAI adopts MCP; OpenAI Agents SDK released
Sep 2025 Claude Agent SDK announced; MCP Registry preview launch
Oct 2025 Agent Skills launched for Claude apps, Claude Code, and API
Nov 2025 MCP spec 2025-11-25; Advanced Tool Use features; Opus 4.5 launch
Dec 2025 Skills published as open standard; MCP donated to Agentic AI Foundation (Linux Foundation)
Jan–Feb 2026 Skills repo reaches 62k+ stars; ongoing ecosystem expansion

Contributing

Contributions welcome! Please read the contribution guidelines first. All submitted links must be verified and genuinely related to Skills for LLMs.


Citation

@misc{xu2026agentskillslargelanguage,
      title={Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward}, 
      author={Renjun Xu and Yang Yan},
      year={2026},
      eprint={2602.12430},
      archivePrefix={arXiv},
      primaryClass={cs.MA},
      url={https://arxiv.org/abs/2602.12430}, 
}

About

Awesome Agent Skills collection list, papers, tools, projects, and resources

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages