Benchmark LLM

  • [[A Benchmark for LLMs on Planning and Reasoning about Change|Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change|[A Benchmark for LLMs on Planning and Reasoning about Change)]]]])))
  • The recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP)
  • From GPT3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model
  • current benchmarks are relatively simplistic and the performance over these benchmarks cannot be used as an evidence to support
  • extensible assessment framework motivated by the above gaps in current benchmarks to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change
  • multiple test cases