Limitations, Ethical Considerations, and More: Everything You Need to Know About WikiWebQuestions

12 Jun 2024


(1) Silei Xu, Computer Science Department, Stanford University Stanford, CA with equal contribution {};

(2) Shicheng Liu, Computer Science Department, Stanford University Stanford, CA with equal contribution {};

(3) Theo Culhane, Computer Science Department, Stanford University Stanford, CA {};

(4) Elizaveta Pertseva, Computer Science Department, Stanford University Stanford, CA, {};

(5) Meng-Hsi Wu, Computer Science Department, Stanford University Stanford, CA, {};

(6) Sina J. Semnani, Computer Science Department, Stanford University Stanford, CA, {};

(7) Monica S. Lam, Computer Science Department, Stanford University Stanford, CA, {}.

Abstract and Introduction

Related Work

Semantic Parsing for Wikidata

WikiWebQuestions (WWQ) Dataset



Experiment with QALD-7

Conclusions, Limitations, Ethical Considerations, Acknowledgements, and References

A. Examples of Recovering from Entity Linking Errors

8 Conclusion

We have created a new high-quality benchmark, WikiWebQuestions, for large knowledge-base question answering. The dataset is based on the popular WebQuestionsSP dataset with natural questions, annotated with SPARQL for Wikidata.

We establish a first, strong baseline of 65% answer accuracy and 72% F1 score for WikiWebQuestions. This is achieved by fine-tuning LLaMA with a few-shot training data set using a SPARQL query format modified for semantic parsing.

We show that we can reduce the hallucination of large language models like GPT-3 by grounding it with a semantic parser. For the dev set of WikiWebQuestions, this combination approach provides useful information for 96% of the questions in the dev set of the benchmark. More importantly, it generates verifiable answers for 76% of the questions.


While applications of large language models seem to expand every day, this paper mainly focuses on factoid question answering. Long-form text generation, for example, is outside the scope of the experiments of this paper, but the methodology described here may be extended to this setting in the future. Even though knowledge bases are an important source of facts, a large portion of the knowledge available in digital form (e.g. Wikipedia, news articles, etc.), is not organized into knowledge bases. As such, the results of this paper can be considered complementary to the larger body of fact-checking research based on free text.

Our semantic parser can be used to verify answers from LLMs. However, this additional round of running the semantic parser and querying Wikidata increase the response latency, which may be noticeable by end-users of such systems.

All of our datasets and experiments are conducted for English. Expanding to other languages, while possible (Moradshahi et al., 2020) are outside the scope of this work.

Our experiments were performed using GPT-3 (davinci-002) as that was what we had access to when we started the project. Undoubtedly, the later LLMs will produce better results. Nonetheless, the need to have verifiable results based on live database accesses will remain.

Ethical Considerations

LLMs are used by millions of people everyday. We hope that this line of work will help make them more reliable for everyone, mitigating some of their potential downsides, and giving users access to more accurate information. Our use of Wikidata will enable future researchers and developers to connect their systems with a large, diverse and live knowledge graph that is updated every day. We do not anticipate any harm resulting from the methods introduced in this work.

We did not crowdsource any datasets for this paper, as the questions are converted from a previous dataset and all the re-annotation and analysis is done by the authors.

To conduct experiments in this paper, we used an estimated total of 60 NC96ads-A100 GPU hours on Microsoft Azure. Each finetuning experiment takes roughly 3 hours, and we conducted roughly 20 experiments to arrive at the results in this paper.


This work is supported in part by the National Science Foundation, the Alfred P. Sloan Foundation, the Verdant Foundation, Microsoft Azure AI credit, KDDI, JPMorgan Chase, and the Stanford HumanCentered Artificial Intelligence (HAI) Institute. We also thank the reviewers for their valuable comments and suggestions.


This paper is available on arxiv under CC 4.0 license.