-
Notifications
You must be signed in to change notification settings - Fork 15
Benchmarks
Shuai Yuan edited this page May 28, 2024
·
7 revisions
- NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
- DebugBench: Evaluating Debugging Capability of Large Language Models
-
VulBenchHow Far Have We Gone in Vulnerability Detection Using Large Language Models - InstructCoder: Empowering Language Models for Code Editing
-
EvalGPTFixA Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair - SWE-bench: Can Language Models Resolve Real-World GitHub Issues
- CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
- CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models
- VerilogEval: Evaluating Large Language Models for Verilog Code Generation
-
G-TransEvalOn the Evaluation of Neural Code Translation: Taxonomy and Benchmark -
HumanEvalPackOctoPack: Instruction Tuning Code Large Language Models -
LogHubA Large-scale Benchmark for Log Parsing Techniques: How Far Are We - BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge
-
ExGroFiDelving into Commit-Issue Correlation to Enhance Commit Message Generation Models -
CoRecContext-aware Retrieval-based Deep Commit Message Generation -
COJ2022Errorclr: Semantic error classification, localization and repair for introductory programming assignments -
VulnPatchPairsLimits of Machine Learning for Automatic Vulnerability Detection -
DotPromptsGuiding Language Models of Code with Global Context using Monitors - LongCoder: A Long-Range Pre-trained Language Model for Code Completion
- StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code
-
EvalPlusIs Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. - DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection
-
RunBugRunAn Executable Dataset for Automated Program Repair -
HumanEval-XCodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X - xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
-
ODEXExecution-Based Evaluation for Open-Domain Code Generation -
ARCADENatural Language to Code Generation in Interactive Data Science Notebooks - DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
-
ExeDSExecution-based Evaluation for Data Science Code Generation Models -
TorchDataEvalWhen Language Model Meets Private Library -
MBXP, Multilingual HumanEval, MathQA-XMulti-lingual Evaluation of Code Generation Models - MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation
- AixBench: A Code Generation Benchmark Dataset
-
PandasEval, NumpyEvalCERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation -
BIG-BenchBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
-
CoSTMultilingual Code Snippets Training for Program Translation -
MTPBCodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis -
CodeContestsCompetition-Level Code Generation with AlphaCode -
DSPTraining and Evaluating a Jupyter Notebook Data Science Assistant -
MBPPProgram Synthesis with Large Language Models - PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context
- CrossVul: a cross-language vulnerability dataset with commit data
-
HumanEvalEvaluating Large Language Models Trained on Code - KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers
- CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model
- CoSQA: 20,000+ Web Queries for Code Search and Question Answering
-
APPSMeasuring Coding Challenge Competence With APPS - CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation.
-
squallOn the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries -
FB-JavaDeep Graph Matching and Searching for Semantic Code Retrieval -
SO-DSNeural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent -
GREATGlobal Relational Models of Source Code - ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking
- CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation
- JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code
- CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
- Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
- CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases
-
CoDiSumCommit Message Generation for Source Code Changes - SParC: Cross-Domain Semantic Parsing in Context
-
PtrGNCMsgGenerating commit messages from diffs using pointer-generator network -
Bugs2FixAn Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. - Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
- SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities
-
CONCODEMapping Language to Code in Programmatic Context -
TL-CodeSumSummarizing source code with transferred API knowledge -
Draper VDISCAutomated Vulnerability Detection in Source Code Using Deep Representation Learning -
NL2BashDeep code comment generation -
DeepComNL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System -
CGDVulDeePecker: A Deep Learning-Based System for Vulnerability Detection - QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge
-
WikiSQLSeq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning -
CommitGenA Neural Architecture for Generating Natural Language Descriptions from Source Code Changes - DeepFix: Fixing Common C Language Errors by Deep Learning
-
PY150Probabilistic Model for Code with Decision Trees -
CODE-NNSummarizing Source Code using a Neural Attention Model -
BigCloneBenchEvaluating clone detection tools with BigCloneBench -
WikiTQCompositional Semantic Parsing on Semi-Structured Tables -
POJ-104Convolutional Neural Networks over Tree Structures for Programming Language Processing - Defects4J: a database of existing faults to enable controlled testing studies for Java programs
-
GitHub Java CorpusMining Source Code Repositories at Massive Scale using Language Modeling -
ATISExpanding the Scope of the ATIS Task: The ATIS-3 Corpus