Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
-
Updated
Sep 10, 2024 - Python
Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
A benchmark for prompt injection detection systems.
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Hallucinations (Confabulations) Document-Based Benchmark for RAG
A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24
Evaluating the Effectiveness of Code-generation Models on Hinglish Prompts
We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.
PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.
This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting
Human readability judgments as a benchmark for LLMs
Python code for the paper "LLMs are zero-shot next-location predictors" by Beneduce et al.
Experiments in evaluating various prompting strategies and LLM performance generally
Benchmark LLMs' abilities to plan, strategize, and reason by making them play chess against each other.
Framework to benchmark LLMs performance on domain categorization, done as part of my internship at iQ Global.
Add a description, image, and links to the llm-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmarking topic, visit your repo's landing page and select "manage topics."