Skip to content

jamesmurdza/humaneval-results

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking Code Generation with LLMs

đź“– Related article: Challenges in code generation with GPT-4

Below is a comparison LLM code generation success on the HumanEval problem dataset. Each task was evaluated ten times per model, by first making a call to the OpenAI API, then running generated code against unit tests. Traces from each run are linked in the table below.

These results were generated with the HumanEval-LangChain workflow.

Results

Runs per task Tasks Overall success
CodeLlama-34b-Instruct-hf 10 163 45.1%
gpt-3.5-turbo 10 163 77.0%
gpt-4 10 163 84.8%

Results by task

Task Code Llama GPT-3.5 GPT-4 Description
HumanEval/0 100% 100% 100% Check if there are any two numbers in the given list that are closer to each other than the given threshold.
HumanEval/1 0% 40% 60% Separate multiple groups of nested parentheses into separate strings and return a list of those strings.
HumanEval/2 100% 100% 100% Return the decimal part of a positive floating point number.
HumanEval/3 100% 100% 100% Detect if at any point the balance of the bank account falls below zero and return True, otherwise return False.
HumanEval/4 100% 100% 100% Calculate the mean absolute deviation of a given list of numbers.
HumanEval/5 100% 100% 100% Insert a number 'delimeter' between every two consecutive elements of the input list `numbers`.
HumanEval/6 0% 100% 100% Parse a string representing multiple groups of nested parentheses and return a list of the deepest level of nesting for each group.
HumanEval/7 100% 100% 100% Filter an input list of strings to only include strings that contain a given substring.
HumanEval/8 0% 100% 100% Calculate the sum and product of all the integers in a given list.
HumanEval/9 100% 100% 0% Generate a list of rolling maximum elements found until a given moment in the sequence.
HumanEval/10 0% 100% 100% The function `make_palindrome` should find the shortest palindrome that begins with the supplied string.
HumanEval/11 90% 100% 100% Perform binary XOR on two input strings and return the result as a string.
HumanEval/12 100% 100% 100% Return the longest string from a list of strings, or None if the list is empty.
HumanEval/13 100% 100% 100% Calculate and return the greatest common divisor of two integers.
HumanEval/14 100% 100% 100% Return a list of all prefixes of the input string, from shortest to longest.
HumanEval/15 100% 100% 100% Return a string containing space-delimited numbers starting from 0 up to n inclusive.
HumanEval/16 100% 100% 100% Count the number of distinct characters in a given string, regardless of case.
HumanEval/17 100% 100% 100% Parse a string representing musical notes and return a list of integers representing the duration of each note in beats.
HumanEval/18 100% 100% 100% Count how many times a given substring can be found in the original string, including overlapping cases.
HumanEval/19 90% 100% 100% Sort the numbers in the input string from smallest to largest.
HumanEval/20 0% 100% 100% Find and return the two numbers in a list that are closest to each other.
HumanEval/21 100% 100% 100% Rescale a list of numbers to the unit interval [0, 1] using a linear transformation.
HumanEval/22 100% 100% 100% Filter a given list of any python values and return only the integers.
HumanEval/23 100% 100% 100% Return the length of a given string.
HumanEval/24 0% 100% 100% Find the largest number that evenly divides a given number n, smaller than n.
HumanEval/25 100% 100% 100% Factorize should return a list of prime factors of a given integer in ascending order.
HumanEval/26 0% 100% 100% Remove all elements from the list that occur more than once while keeping the order of the remaining elements the same.
HumanEval/27 100% 100% 100% The function should flip the case of each character in the given string.
HumanEval/28 100% 100% 100% Concatenate a list of strings into a single string.
HumanEval/29 100% 100% 100% Filter an input list of strings to only include strings that start with a given prefix.
HumanEval/30 100% 100% 100% Return a new list containing only the positive numbers from the input list.
HumanEval/31 100% 100% 100% Determine whether a given number is prime or not.
HumanEval/32 0% 0% 30% The function `find_zero` should find a zero point of a polynomial with given coefficients.
HumanEval/33 0% 100% 100% Sort the values at indices divisible by three in the given list, while keeping the values at other indices unchanged.
HumanEval/34 100% 100% 100% Return a sorted list of unique elements from the input list.
HumanEval/35 100% 100% 100% Return the maximum element in the given list.
HumanEval/36 0% 90% 100% Count the number of times the digit 7 appears in integers less than n that are divisible by 11 or 13.
HumanEval/37 0% 100% 100% Sort the even indices of a list in ascending order, while keeping the odd indices unchanged.
HumanEval/38 10% 100% 100% The function `encode_cyclic` should return an encoded string by cycling groups of three characters, while the function `decode_cyclic` should return the decoded string from an encoded string.
HumanEval/39 0% 20% 100% Return the n-th number that is both a Fibonacci number and a prime number.
HumanEval/40 100% 100% 100% Check if there are three distinct elements in a list that sum to zero.
HumanEval/41 0% 0% 60% Count the number of collisions between cars moving in opposite directions on an infinitely long road.
HumanEval/42 100% 100% 100% The function should return a list with all elements incremented by 1.
HumanEval/43 100% 100% 100% Check if there are two distinct elements in a list that sum to zero.
HumanEval/44 0% 100% 100% Convert a given number from its current base to a specified base and return the string representation of the converted number.
HumanEval/45 100% 100% 100% Calculate and return the area of a triangle given the length of one side and the height.
HumanEval/46 40% 100% 100% Compute the n-th element of the Fib4 number sequence efficiently without using recursion.
HumanEval/47 100% 100% 100% Calculate and return the median of the elements in the given list.
HumanEval/48 100% 100% 100% Check if the given string is a palindrome.
HumanEval/49 100% 100% 100% Calculate the value of 2 raised to the power of n modulo p.
HumanEval/50 100% 100% 100% The function above should decode a string that has been encoded using the encode_shift function.
HumanEval/51 20% 100% 100% The function should remove all vowels from a given string.
HumanEval/52 100% 100% 100% Return True if all numbers in the list l are below the threshold t.
HumanEval/53 100% 100% 100% The function should add two numbers together.
HumanEval/54 20% 0% 100% Check if two words have the same characters.
HumanEval/55 100% 100% 0% Return the n-th Fibonacci number.
HumanEval/56 100% 100% 100% Check if every opening bracket in the given string has a corresponding closing bracket.
HumanEval/57 100% 100% 100% Return True if the elements in the given list are monotonically increasing or decreasing.
HumanEval/58 100% 100% 100% Return a sorted list of unique common elements between two input lists.
HumanEval/59 0% 100% 100% Return the largest prime factor of a given number.
HumanEval/60 100% 100% 100% The function `sum_to_n` should calculate the sum of numbers from 1 to n.
HumanEval/61 100% 100% 100% Check if every opening bracket in the given string has a corresponding closing bracket.
HumanEval/62 0% 100% 100% Calculate the derivative of a polynomial represented by a list of coefficients.
HumanEval/63 0% 100% 100% Compute the n-th element of the FibFib number sequence efficiently.
HumanEval/64 0% 100% 90% Count the number of vowels in a given word, considering 'y' as a vowel only when it is at the end of the word.
HumanEval/65 0% 60% 100% Shift the digits of the integer x to the right by shift positions and return the result as a string, or reverse the digits if shift is greater than the number of digits.
HumanEval/66 100% 100% 100% The function should calculate the sum of the ASCII codes of the uppercase characters in a given string.
HumanEval/67 0% 80% 100% Return the number of mango fruits in the basket given the total number of fruits and the number of apples and oranges.
HumanEval/68 90% 100% 100% The function `pluck` should find the smallest even value in an array representing a branch of a tree and return it along with its index.
HumanEval/69 0% 60% 100% Return the greatest integer that has a frequency greater than or equal to the value of the integer itself, or -1 if no such integer exists.
HumanEval/70 40% 70% 100% Sort a given list of integers in a specific order.
HumanEval/71 0% 100% 100% Calculate the area of a triangle if the given side lengths form a valid triangle, otherwise return -1.
HumanEval/72 100% 100% 100% Return True if the object q is balanced and the sum of its elements is less than or equal to the maximum weight w, otherwise return False.
HumanEval/73 0% 100% 100% Find the minimum number of elements that need to be changed to make the array palindromic.
HumanEval/74 10% 20% 80% Return the list with the total number of characters in all strings less than the other list, or the first list if they have the same number of characters.
HumanEval/75 0% 0% 100% Check if the given number is the product of three prime numbers.
HumanEval/76 0% 70% 80% The function should determine whether a given number is a simple power of another given number.
HumanEval/77 0% 50% 50% Check if the given integer is a cube of some integer number.
HumanEval/78 10% 100% 100% Count the number of prime hexadecimal digits in a given hexadecimal number.
HumanEval/79 0% 100% 100% Convert a decimal number to binary format and return it as a string with 'db' at the beginning and end.
HumanEval/80 100% 100% 100% Check if a given string is ""happy"" by determining if its length is at least 3 and every 3 consecutive letters are distinct.
HumanEval/81 0% 80% 100% The function should convert a list of GPAs into a list of corresponding letter grades.
HumanEval/82 100% 100% 100% Return True if the length of the input string is a prime number, and False otherwise.
HumanEval/83 100% 40% 0% Count the number of n-digit positive integers that start or end with 1.
HumanEval/84 20% 70% 0% Return the total sum of the digits of a positive integer in binary form.
HumanEval/85 60% 100% 100% Add the even elements that are at odd indices in a given non-empty list of integers.
HumanEval/86 100% 100% 100% The function should return a string where all the words are replaced by a new word with their characters arranged in ascending order based on ascii value, while keeping the order of words and blank spaces in the sentence.
HumanEval/87 60% 100% 0% The function should return a list of tuples representing the coordinates of the integer x in the given 2-dimensional list, sorted by rows in ascending order and by columns in descending order.
HumanEval/88 10% 90% 100% Sort the given array in ascending order if the sum of the first and last index values is odd, or sort it in descending order if the sum is even, and return a copy of the sorted array.
HumanEval/89 0% 100% 100% Encrypt a given string by shifting each letter down the alphabet by two multiplied to two places.
HumanEval/90 0% 100% 100% The function should return the second smallest element in a given list of integers, or None if there is no such element.
HumanEval/91 0% 0% 0% Count the number of sentences that start with the word ""I"" in a given string.
HumanEval/92 70% 100% 100% Return true if one of the numbers is equal to the sum of the other two, and all numbers are integers; otherwise, return false.
HumanEval/93 0% 10% 0% Encode a given message by swapping the case of all letters and replacing vowels with the letter that appears 2 places ahead in the English alphabet.
HumanEval/94 20% 100% 100% The function should find the largest prime value in a given list of integers and return the sum of its digits.
HumanEval/95 0% 0% 100% Check if all keys in a dictionary are either all lowercase or all uppercase strings, and return True if they are, otherwise return False.
HumanEval/96 0% 100% 100% Return an array of the first n prime numbers that are less than n.
HumanEval/97 100% 100% 100% Return the product of the unit digits of two given integers.
HumanEval/98 0% 70% 100% Count the number of uppercase vowels in even indices of a given string.
HumanEval/99 0% 100% 100% Return the closest integer to a given number, rounding away from zero if the number is equidistant from two integers.
HumanEval/100 0% 0% 100% Create a pile of stones with a specific number of levels, where each level has a specific number of stones.
HumanEval/101 0% 100% 100% Split a string of words separated by commas or spaces into an array of individual words.
HumanEval/102 0% 100% 100% Return the largest even integer within the range [x, y] inclusive, or -1 if there is no such number.
HumanEval/103 0% 20% 100% Calculate the average of a range of integers, round it to the nearest integer, and convert it to binary.
HumanEval/104 70% 100% 100% Return a sorted list of positive integers from the input list that do not contain any even digits.
HumanEval/105 50% 100% 90% Sort an array of integers between 1 and 9 inclusive, reverse the array, and replace each digit with its corresponding name.
HumanEval/106 80% 100% 30% Calculate and return a list of size n, where each element at index i is the factorial of i if i is even, or the sum of numbers from 1 to i otherwise.
HumanEval/107 100% 100% 100% The function should return a tuple containing the number of even and odd integer palindromes within the range (1, n).
HumanEval/108 0% 0% 10% Count the number of elements in the array that have a sum of digits greater than 0.
HumanEval/109 0% 100% 100% Determine if it is possible to obtain a sorted array by performing right shift operations on the given array.
HumanEval/110 0% 10% 100% Determine whether it is possible to exchange elements between two lists to make the first list contain only even numbers.
HumanEval/111 0% 90% 100% Return a dictionary containing the letter(s) with the highest occurrence and their corresponding count from a given string.
HumanEval/112 100% 100% 100% The function should delete characters from string `s` that are equal to any character in string `c`, and then check if the resulting string is a palindrome.
HumanEval/113 0% 0% 100% Count the number of odd digits in each string of a given list and return a list of strings describing the count for each string.
HumanEval/114 40% 100% 100% Calculate the minimum sum of any non-empty sub-array in the given array of integers.
HumanEval/115 0% 20% 0% The function `max_fill` should calculate the number of times the buckets need to be lowered in order to empty the wells in the given grid.
HumanEval/116 100% 100% 100% Sort an array of non-negative integers based on the number of ones in their binary representation in ascending order, and for numbers with the same number of ones, sort based on decimal value.
HumanEval/117 0% 100% 100% Return a list of words from the input string that contain exactly n consonants.
HumanEval/118 0% 20% 10% Find the closest vowel that stands between two consonants from the right side of the word.
HumanEval/119 0% 0% 40% Check if it is possible to concatenate two strings of parentheses in some order to create a balanced string.
HumanEval/120 0% 0% 0% Return a sorted list of the maximum k numbers in the given array.
HumanEval/121 80% 0% 100% Return the sum of all odd elements in even positions in a given list of integers.
HumanEval/122 100% 10% 100% The function should return the sum of the elements with at most two digits from the first k elements of the given array.
HumanEval/123 10% 100% 100% Return a sorted list of the odd numbers in the Collatz sequence for a given positive integer.
HumanEval/124 90% 100% 100% Validate a given date string and return True if the date is valid according to the specified rules, otherwise return False.
HumanEval/125 0% 0% 100% The function should split a string of words on whitespace or commas, and if neither exists, it should return the number of lowercase letters with odd order in the alphabet.
HumanEval/126 0% 0% 50% Check if a given list of numbers is sorted in ascending order and does not contain more than one duplicate of the same number.
HumanEval/127 90% 10% 80% Determine whether the length of the intersection of two intervals is a prime number.
HumanEval/128 100% 80% 60% The function `prod_signs` should calculate the sum of the magnitudes of integers in the input array, multiplied by the product of the signs of each number.
HumanEval/129 0% 0% 20% Find the minimum path of length k in a grid, where each cell contains a unique value, and return the ordered list of values on the cells that the minimum path goes through.
HumanEval/130 0% 0% 0% Return a list of the first n + 1 numbers of the Tribonacci sequence.
HumanEval/131 0% 10% 100% Return the product of the odd digits in a given positive integer, or 0 if all digits are even.
HumanEval/132 0% 0% 0% Check if a given string contains a valid subsequence of square brackets where at least one bracket is nested.
HumanEval/133 0% 0% 100% The function should calculate the sum of the squared numbers in a given list, after rounding each element to the nearest upper integer.
HumanEval/134 0% 90% 0% Check if the last character of a given string is an alphabetical character and is not part of a word.
HumanEval/135 0% 80% 100% Return the largest index of an element that is not greater than or equal to the element immediately preceding it, or -1 if no such element exists.
HumanEval/136 100% 100% 100% Return a tuple containing the largest negative integer and the smallest positive integer from a given list, or None if there are no negative or positive integers.
HumanEval/137 0% 0% 90% Return the larger variable in its given variable type, or None if the values are equal.
HumanEval/138 100% 100% 90% Evaluate whether the given number n can be written as the sum of exactly 4 positive even numbers.
HumanEval/139 0% 80% 100% Calculate the special factorial of a given integer.
HumanEval/140 0% 70% 50% Replace spaces in a given string with underscores, and if there are more than 2 consecutive spaces, replace them with a hyphen.
HumanEval/141 0% 100% 90% Validate whether a given file name meets certain criteria.
HumanEval/142 100% 0% 100% Calculate the sum of squared and cubed entries in a list based on their index.
HumanEval/143 100% 100% 100% Return a string containing the words from the original sentence whose lengths are prime numbers, in the same order as the original sentence.
HumanEval/144 90% 100% 100% The function should determine whether the product of two fractions is a whole number or not.
HumanEval/145 0% 0% 0% Sort the given list of integers in ascending order based on the sum of their digits, with ties broken by the index in the original list.
HumanEval/146 0% 100% 100% Count the number of elements in the input array that are greater than 10 and have both the first and last digits as odd numbers.
HumanEval/147 0% 90% 0% Calculate the number of triples in an array where the sum of the elements is a multiple of 3.
HumanEval/148 0% 100% 80% Return a tuple of planets whose orbits are located between the orbit of planet1 and the orbit of planet2, sorted by proximity to the sun.
HumanEval/149 0% 100% 100% Delete strings with odd lengths from a list of strings and return the remaining strings in ascending order of length, with alphabetical sorting for strings of the same length.
HumanEval/150 100% 100% 100% Return the value of x if n is a prime number and return the value of y otherwise.
HumanEval/151 50% 100% 90% Calculate the sum of squares of the odd numbers in the given list, ignoring negative numbers and non-integers.
HumanEval/152 100% 100% 100% The function should compare the guesses of a person with the actual scores of a number of matches and return an array indicating how far off each guess was.
HumanEval/153 10% 70% 50% Return the strongest extension for a given class name and list of extensions.
HumanEval/154 0% 100% 100% Check if the second word or any of its rotations is a substring in the first word.
HumanEval/155 0% 0% 0% Return a tuple containing the count of even and odd digits in the given integer.
HumanEval/156 40% 100% 100% Convert a positive integer to its lowercase Roman numeral equivalent.
HumanEval/157 0% 100% 100% Check if the given lengths of the three sides of a triangle form a right-angled triangle.
HumanEval/158 40% 100% 100% Return the word with the maximum number of unique characters from a list of strings.
HumanEval/159 0% 50% 100% The function should calculate the total number of carrots eaten and the number of carrots remaining after a rabbit has eaten a certain number of carrots and still needs to eat more.
HumanEval/160 0% 100% 50% Evaluate an algebraic expression using the given operators and operands.
HumanEval/161 0% 100% 100% Reverse the case of letters in a string, and if the string contains no letters, reverse the string.
HumanEval/162 100% 100% 100% Convert a given string into its MD5 hash equivalent.
HumanEval/163 0% 0% 80% Generate a list of even digits between two given positive integers, in ascending order.

About

Evaluation results of code generation LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published