Abstract

In general knowledge base question answering (KBQA) models, subject recognition (SR) is usually a precondition of finding an answer, and it is a common way to employ a general named entity recognition (NER) model such as BERT-CRF to recognize the subject. However, in previous researches, the difference between a NER task and a SR task is usually ignored, and a wrong entity recognized by the NER model will certainly lead to a wrong answer in the KBQA task, which is one bottleneck for KBQA performance. In this paper, a multigranularity pruning model (MGPM) is proposed to answer a question when general models fail to recognize a subject. In MGPM, the set of all possible subjects in the Knowledge Base (KB) is pruned by 4 multigranularity pruning submodels successively based on the constraint of relation (domain and tuple), string similarity, and semantic similarity. Experimental results show that our model is compatible with various KBQA models for both single-relation and complex questions answering. The integrated MGPM model (with the BERT-CRF model) achieves a SR accuracy of 94.4% on the SimpleQuestions dataset, 68.6% on the WebQuestionsSP dataset, and 63.7% on the WebQuestions dataset, which outperforms the original model by a margin of 3.6%, 8.6%, and 5.3%, respectively.

1. Introduction

Knowledge base question answering (KBQA) is an important natural language processing (NLP) task which is aimed to answer natural language questions automatically with facts in a knowledge base (KB). In general, there are 4 subtasks under KBQA: subject recognition (SR), entity linking, relation prediction, and answer retrieval. In SR, the subject entity in an input question is recognized, which is an entity (or a set of entities with a same name) in a KB. In entity linking, a unique entity is selected from the entity set. In relation prediction, a relation in the KB is selected as the best one to describe the question. In answer retrieval, one or more entities can be retrieved from the KB based on the subject and relation, and a unique entity is selected from them as the answer to the question.

An example is shown in Figure 1. A question “What poetry did Shakespeare write in 1604?” is fed to a KBQA system. First, in the SR task, the entity “Shakespeare” in the KB is recognized as the subject entity to the question. As there are several entities (e.g. a person named Shakespeare, a book titled Shakespeare) which have the same name “Shakespeare” in the KB, in the Entity Linking task, the unique entity “Shakespeare” with the attribute “person” is selected from all entities named “Shakespeare.” Then, all relation candidates (e.g., write, birthplace, country, etc.) with the subject “Shakespeare” can be retrieved from the KB and the best-matched one “write” is selected in the relation prediction task. Finally, entities with the subject “Shakespeare (person)” and relation “write” can be retrieved from the KB, which represents all creations of Shakespeare in this example, and the best-matched one “The Sonnets” is selected in the answer retrieval task. In general, the input of each task is the input question and a set of candidates retrieved from the KB based on the output of the previous task (except the first task SR), and the output of each task is an element of the candidate set.

In practical KBQA systems, there could be some differences in these tasks. For example, to a complex question such as “What poetry did the author of Venus and Adonis write in 1604?,” several relations are required in the relation prediction task. To most questions in the SimpleQuestions dataset, a unique answer could be retrieved based on a subject name and a relation, so the entity linking task and answer retrieval task could be integrated with other tasks.

In general KBQA models, a named entity recognition (NER) model (e.g., BERT-CRF) is usually employed to recognize the entity which contains one or several successive words in a question as the subject entity in the SR task. In these models, a correct subject is the precondition of a correct relation and answer. Unfortunately, there are no models which could achieve an accuracy of 100% so there are always some questions where no or wrong entities and subjects are recognized and matched. In general, they fail for mainly 3 reasons:(i)Question with abnormal subject (QWAS): the golden subject in the KB cannot be strictly matched to any n-grams generated from the question [1]. For example, to question “what is the name of a location in brasília standard time,” the golden subject “brasília time zone” in the KB cannot be strictly matched to “brasília standard time” in the question.(ii)No subject matched (NSM): the recognized entity (to a normal question) by the NER model cannot be strictly matched to any subject in the KB, because it contains no, wrong, not enough, or redundant words. For example, to question “what type of music does David ruffin play” with the golden subject “David ruffin,” the recognized entity “play,” “ruffin,” or “David ruffin play” would lead to no subject matched.(iii)Wrong subject matched (WSM): the recognized entity (to a normal question) by the NER model is strictly matched to a wrong subject in the KB. Maybe it is a correct entity in a general NER task, but it will lead to a wrong subject in a KBQA task. In the aforementioned example, “music” is a subject in the KB, but it is not the golden subject to this question.

In aforementioned examples, a correct entity in a NER task is not necessarily a correct subject in a SR task. In general, there are mainly 3 differences between a NER task and a SR task under KBQA:(i)Aim: the aim of a NER task is to recognize several successive words in an input sentence as the entity, whereas the aim of a SR task is to select one entity in the KB as the subject of an input question.(ii)Number: there could be no, one, or several entities for an input sentence in a NER task, whereas there should be one and only one subject for an input question in a SR task.(iii)Constraint: a NER task is usually an independent task and there are no extra constraints for the recognized entity, whereas a SR task is a subtask of KBQA and there are mutual constrains between the subject and relation of an input question.

As a result, besides employing an error correction model to reduce QWAS and an improved NER model to reduce NSM and WSM, a compatible solution is strongly required to focus the difference between a NER task and a SR task and answer questions when general models fail. In this paper, we propose the multigranularity pruning model (MGPM) to recognize subjects and answer questions in these cases. Compared with general NER models, the search space in our model is not successive words in questions, but subjects in the KB, so our model could find correct subjects even to some QWAS. The original massive search space in the KB is narrowed down by 4 multigranularity submodels gradually based on relation dependence, string similarity, and semantic similarity. In this way, the subject with the highest score (calculated by our model) would be considered as the recognized subject and then a general KBQA model could be employed to answer the question based on the recognized subject.

Our main contributions are as follows:(i)We focus on the difference between a NER task and a SR task and propose a method which is still effective in the case that general models fail to recognize subjects.(ii)Our method is dataset-agnostic and effective on datasets of both simple questions and complex questions.(iii)Our method is model-agnostic and compatible with various KBQA models. Experimental results show that the integrated MGPM model with the BERT-CRF model (or Efficient GlobalPointer, EGP) outperforms the original model by a margin of 3.6% (or 4.7%) on the SimpleQuestions dataset, 8.6% (or 8.1%) on the WebQuestionsSP dataset, and 5.3% (or 4.8%) on the WebQuestions dataset.

The research of KBQA has evolved from earlier domain-specific question answering [2] to open-domain QA based on large-scale KBs such as Freebase [3]. The model of KBQA has also evolved from semantic parsing-based models [4], which parse questions into structured queries, to neural network-based models [5, 6], which learn semantic representations of both the question and the knowledge from observed data. Some researchers [79] also attempt to combine multiple models to utilize information in natural language questions and KBs.

After pretrained models such as BERT [10], ALBERT [11], XLNet [12], and ELECTRA [13] are proposed, they have been widely employed in various NLP tasks [1417]. Many researches employed NER and RE models based on pretrained models and achieved good results. For example, Gangwar et al. [18] employed pretrained models in the span extraction, classification, and relation extraction task focused on finding quantities, attributes of these quantities, and additional information. Luo et al. [19] proposed a BERT-based approach for single-relation question answering (SR-QA), which consists of two models, entity linking, and relation detection. Zhu [20] designed a comprehensive search space for BERT-based relation classification models and employ neural architecture search method to automatically discover the design choices. However, in different situations, the best-performance model is also different. For example, ELECTRA achieves better performance in some tasks in GLUE [21], ALBERT requires less training cost, and RoFormer is more effective in Chinese NLP tasks. As a result, it would be satisfactory if a proposed method could be a model-agnostic solution which is not relied on a specific KBQA model or a specific dataset. In this paper, our proposed model works well as a plug-in approach to different KBQA models to improve their results on different datasets.

For the SR task under KBQA, in traditional methods, as the performance of general NER models is not satisfactory, researchers usually employ constraint (e.g., relation constraint, similarity constraint, type constraint, etc.), which does not exist in general NER tasks, to achieve a better performance. For example, in the CFO model [1], the subject candidates in a KB are pruned based on the constraint of relation candidates generated by the model. After pretrained models are proposed, as they show satisfactory performance in general NER tasks, which is even better than traditional SR models in KBQA tasks, it is a common way to regard the SR task under KBQA as a general NER task and simply employ a NER model [2226]. However, we cannot find a NER model which can achieve an accuracy of 100%, so if constraint could also be employed in a SR task, it would probably achieve a better performance. Unfortunately, constraint is not well-compatible with pretrained models and it is difficult to employ constraint in a pretrained NER model directly.

Besides BERT-CRF, which is a common model for both KBQA and general NER tasks, researchers have proposed many models for various NER tasks in recent years. For example, some researchers [27] propose a unified generative framework for various NER subtasks to recognize flat, nested, and discontinuous entities, some researchers [28] focus on utilizing both segment-level information and word-level dependencies in NER tasks, and some researchers [29] employ a maximal clique discovery method in a discontinuous NER task. However, a model which is effective in a general NER task may show worse performance in a subject recognition task because of the difference between them, so it is difficult to employ them directly in subject recognition. In addition, there is not a model which could achieve an accuracy of 100%, so there are always some questions where a general NER model fails.

In practical applications, a NLP model is often supposed to answer noisy and abnormal questions caused by various reasons (e.g., noise in the processes of transmission, transformation, or translation). Sometimes the input to a NLP system is even transformed from a piece of voice, video, or image. If the raw voice, video, or image is available, we could feed them directly into special models such as VL-BERT [30], LXMERT [31], VideoBERT [32], ClipBERT [33], wav2vec [34], or SpeechBERT [35], to avoid errors caused by transformation. In addition, the structure of the original model could also be improved so that such noise could be handled automatically by the model. For example, Yang et al. [36] proposed a robust and structurally aware table-text encoding architecture TableFormer, where tabular structural biases are incorporated completely through learnable attention biases. Su et al. [37] proposed a pretrained Chinese Bert that is robust to various forms of adversarial attacks like word perturbation, synonyms, and typos. Liu et al. [38] proposed a robustly optimized bidirectional machine reading comprehension method by incorporating four improvements. Besides, there are also some researches who focus on finding and eliminating noisy labels in datasets so that the model could be trained without noise. For example, Zhu and Michael [39] showed that for text classification tasks with modern NLP models like BERT, over a variety of noise types, existing noisehandling methods do not always improve its performance. Ye et al. [40] proposed a general framework named label noise-robust dialogue state tracking to train DST models robustly from noisy labels, instead of improving the annotation quality further. Nguyen and Khatwani [41] studied the impact of instance-dependent noise to performance of product title classification by comparing our data denoising algorithm and different noise-resistance training algorithms which were designed to prevent a classifier model from overfitting to noise. However, compared to a RE model [42], a NER model is much more sensitive to noise and an entity with a wrong character would be matched to a wrong subject. As a result, it is difficult to employ these methods directly in subject recognition to answer these QWAS in KBQA.

As the golden subject is even not included in a QWAS, it is impossible for a general NER model to recognize it correctly. To answer these QWAS, it is a feasible strategy to correct possible errors in input by a spelling error correction model [4345]. Another strategy is feeding the raw data (e.g. voice) to a multimodal model to avoid errors in transformation [35]. Some researchers [46] also study NER under a noisy labeled setting with calibrated confidence estimation. However, it is usually impractical to ensure that there are no errors in all input questions. In addition, even if a question contains no errors, it could still be a QWAS because the “correct” entity in it may be unmatched to all subjects in the KB. As a result, it is necessary to propose an effective model to recognize subjects when general models fail to recognize matched subjects.

3. Approach

3.1. Overview

A KB, such as Freebase [3], contains three components: a set of entities , a set of relations , and a set of facts , where are subject-relation-object tuples. To answer a single-relation question, a best-matched subject and a best-matched relation would be found by a model so that the predicted object could be retrieved from as the answer. To answer a complex question, several candidates (path, subgraph, SPARQL statement, etc.), which generally consist of a subject and several transition relations and entities, would be generated and scored to find a best-matched one to retrieve the answer.

Since there can be millions of entities and thousands of relations in a KB, it is usually difficult and ineffective for a model to find and directly. In general, a NER model is usually employed to select several successive tokens in the input question as . Then, RE or other models would be employed to generate relation candidates based on and find the best-matched one.

However, in the NER model, is selected from tokens in the input question instead of , so it is possible that or . In this case, no candidates would be generated and no answer would be found. In this paper, MGPM is proposed to answer such questions.

The overall structure of our model is shown in Figure 2. and are generated based on all possible relations and subjects in the KB. Then, is pruned by Pruning Model I to generate , and is further pruned by Pruning Model II to generate . Then, is pruned based on the constraint of the relation constraint (domain and tuple) of to generate . Especially, is a subset of , so the pruned subject set by Pruning Model II is also a subset of the pruned subject set by Pruning Model I. As a result, the pruning process by Pruning Model I is shown as a dashed line and the pruned subject set generated by it is not shown in the figure. is further pruned by Pruning Model III (based on the string similarity constraint) to generate . is also further pruned by Pruning Model IV (based on the semantics similarity constraint) to get the set of the best-matched subject . Thus, a general KBQA model can be employed to find the answer based on . In addition, we can simply exclude some of these submodels by setting corresponding parameters to 0. However, experimental results in Section 4.4 show that the whole model with all 4 submodels achieve the best performance.

Figure 3 shows the pruning process of MGPM with an example question “what kind of release is the best of cinema music?.” Subject candidates would be pruned in the following steps:(i)Sets of relations, facts, and subjects are generated based on the KB.(ii)Pruning Model I is employed to score each relation domain, and domains with scores below a threshold (determined in our experiments) will be eliminated (gray background area). In this example, the domain “book” is eliminated.(iii)Pruning Model II is employed to score each relation tuple, and tuples with scores below a threshold (determined in our experiments) will be eliminated (red background area). In this example, the relation tuple “music/album/artist” is eliminated.(iv)The set of subjects is pruned based on the constraint of the relation constraint in the set of facts. Subjects which cannot satisfy the constraint will be eliminated (green background area).In this example, the subject “the” is eliminated. Especially, subjects which satisfy the constraint of relation tuple can certainly satisfy the constraint of relation domain (the dashed line in the figure).(v)Pruning Model III is employed to score each subject, and subjects with scores below a threshold (determined in our experiments) will be eliminated (blue background area). In this example, the subject “beginnings” is eliminated.(vi)Pruning Model IV is employed to score each subject again, and all subjects except the subject with the top score will be eliminated (yellow background area). In this example, the subject “best” is eliminated.(vii)Only one subject remains in the set, which will be output as the best-matched subject in the model. In this example, the best-matched subject “the best of cinema music” is output, which is also the golden subject of the input question.

In general, there are two main differences between our method and subject recognition with a general NER model:(i)Search space: the search space of a general NER model contains successive words in the question, whereas the search space of our method contains subjects in the KB.(ii)Matching strategy: in a general NER model, the recognized entity is matched to each subject in the KB and the identical one is considered as the matched subject. In our method, the subject with the highest score (calculated by our model) is considered as the matched subject.

In addition, the order of these submodels in MGPM has no influence to the final result because is the subject with the highest score in Pruning Model IV, which should not be eliminated by any of other submodels. However, different orders lead to different time cost and the order in the figure leads to lowest time cost because of least calculations in total. For example, to the question “what is the name of a location in brasília standard time,” if Pruning Model IV is employed first, there are 3972k candidates to be scored by the model. After Pruning Model III, which is not a neural network model and its time cost could be ignored, is employed, 287k candidates would remain. After Pruning Model I, which has 90 domain candidates, is employed, 2.5k candidates would remain then. After Pruning Model II, which has 25 relation candidates matched to the well-matched domain “time,” is employed, only 75 candidates would remain at last.

3.2. Pruning Model I: Relation Domain Constraint

As there are millions of entities in the KB, it is quite difficult to find the best-matched subject in directly. If the golden relation to a single-relation question (or the golden first-hop relation to complex question) is known, the set of entities can be pruned based on the relation constraint and the pruned set would be generated. Then, it is much easier to find in .

In addition, in some KBs such as Freebase, relations are organized as hierarchical structure “domain/type/topic.” Compared to thousands of relations, there are much less domains and each of them only matches hundreds of relations. Therefore, a set of domain is generated, and then the best-matched domain and best-matched relation would be found. However, such domains and relations are often not golden ones because of too many candidates and the error transfer. As a result, Pruning Model I is employed and a set of well-matched domains is generated to prune to a set of relations with well-matched domains . Then, Pruning Model II is employed to prune to a set of well-matched relations . To complex questions, although there could be multiple golden relations, the golden first-hop relation is still better-matched to the question than wrong relations so it is most probably contained in . As a result, this method is also effective to complex questions.

In Pruning Model I, to a question (Token I), we generate question-domain pairs as Token II. Then, we feed them into a BERT-based classification model and get a prediction set :

In the equation, is the probability that pair belongs to Class 0 (unmatched) and is the probability that this pair belongs to Class 1 (matched). Then, a set of well-matched domains is generated as follows:

In the equation, is a threshold value to decide whether a domain is well-matched, which is set by our experiments. In the case that ø, we set

In this case, the set of well-matched domains only contains a unique best-matched domain.

3.3. Pruning Model II: Relation Tuple Constraint

After is generated, can be pruned to a set of relations with such domains . In Pruning Model II, will be further pruned. To a question (Token I), we generate question-relation pairs as Token II and get a prediction set :

In the equation, is the probability that pair belongs to Class 0 (unmatched), is the probability that this pair belongs to Class 1 (matched), and is the probability that this pair belongs to Class 2 (related). A pair belonged to Class 2 means that the relation is unmatched to the question but matched to the subject and experimental results show that it is effective to add Class 2. Experimental results in Section 4.4 show the effectiveness of this strategy.

Then, a set of well-matched relations is generated:

In the equation, is a threshold value to decide whether a relation is well-matched, which is set by our experiments. In the case that ø, we set

In this case, the set of well-matched relations only contains a unique best-matched relation. Especially, in the situation that , the best-matched relation would be found for each question.

3.4. Pruning Model III: String Similarity Constraint

Based on the relation constraint of , in the KB is pruned to a set of subject candidates . However, it is still difficult and ineffective to recognize the best-matched subject from them. As a result, Pruning Model III is employed to prune further based on the string similarity.

To SimpleQuestions and WebQuestionsSP datasets, the golden subject is mentioned in a question, so proposition “Entity is the golden subject to a question” is a necessary but not sufficient condition for proposition “Entity is identical to some successive words in a question.” To a question, there may be multiple entities in the KB which are identical to some successive words in a question, but only one of them is the golden subject. To QWAS, the golden subject is probably string similar to the abnormal subject as the limited impact by errors because it is impractical and meaningless to change most characters in a subject.

Levenshtein algorithm [47] is a common way to calculate the similarity between two strings. In the algorithm, Levenshtein ratio is calculated by the following equation:

In the equation, is the total length of the two strings and is the edit distance between them. is positively related to the similarity between two strings and for two same strings is 1.

However, for a subject and a question is nonsensical, because the subject is generally a part of a question. Therefore, the original Levenshtein algorithm is modified:

In the equation, is the set of tokens of a subject ( tokens in total) and is the set of tokens of a question ( tokens in total). To mitigate interferences of ineffective information, we eliminate tokens with symbols, numbers, and repetitive words. In this way, for each token in a subject , the token with the maximum similarity (maximum Levenshtein ratio) in question will be found. Then, the average Levenshtein ratio is calculated to evaluate the similarity between a subject and the most similar words in a question . Obviously, for a normal question and its golden subject is 1.

Then, a set of similar subjects is generated:

In the equation, is a threshold value to decide whether a subject is similar, which is set by our experiments.

3.5. Pruning Model IV: Semantics Similarity Constraint

In previous submodels, a set of similar subjects is generated. Then, Pruning Model IV, which is also based on a BERT-based classification model, is employed to find out the best-matched subject based on the semantic similarity. In this model, to a question (Token I), we generate question-subject pairs as Token II and get a prediction set :

In the equation, is the probability that pair belongs to Class 0 (unmatched) and is the probability that this pair belongs to Class 1 (matched). Then, we get the best-matched subject (set) by the following equation:

Then, the answer could be found by a general KBQA model based on .

3.6. Combination of Submodels

In MGPM, the aforementioned submodels are combined as pipeline processes and the core algorithm of the whole model is shown in Algorithm 1.

Input: Question , Entity set , Relation set , Domain set , Fact set
Output: Recognized subject
(1) Initialize
(2)forindo
(3)  Calculate by Pruning Model I
(4)  ifthen
(5)end for
(6)forindo
(7)  ifinthen
(8)end for
(9)forindo
(10)  Calculate by Pruning Model II
(11)  ifthen
(12)end for
(13)forindo
(14)  ifinandnot inthen
(15)end for
(16)forindo
(17)  Calculate by Pruning Model III
(18)  ifthen
(19)end for
(20)forindo
(21)  Calculate by Pruning Model IV
(22)  
(23)end for
(24)
(25)return

In our model, question is first fed to Pruning Model I and (generated from the KB) is pruned based on the score (calculated by Pruning Model I) of each domain to generate . Then, (generated from the KB) would be pruned to by selecting all relations which contain a domain in . Pruning Model II is then employed to calculate the score of each relation in , and would be pruned to based on the score. Based on , would be pruned to by selecting all entities which belong to a fact (in ) which contains a relation in . Pruning Model III is then employed to calculate the score of each subject in , and would be pruned to based on the score. Finally, Pruning Model IV is employed to calculate the score of each subject in , and the subject with the highest score would be output as the recognized subject in our model.

4. Experiments

4.1. Dataset

The SimpleQuestions dataset [48] is a KBQA dataset of single-relation questions. It provides 108,442 single-relation questions with their answer facts, which are paired with subject-relation-object tuples from Freebase. The dataset is split into a training set, a validation set, and a test set, with 75,910, 10,845, and 21,687 question-fact pairs, respectively. Among all pairs in the test dataset, there are 1,385 QWAS which are unmatched to FB5M (a subset of Freebase) or all n-grams which could be generated from the question [1]. To evaluate the adaptability of our model to normal questions and QWAS, we divide the test set into Dataset I which contains 20,302 pairs of normal questions and Dataset II which contains 1,385 pairs of QWAS.

The WebQuestionsSP dataset [49] is a KBQA dataset of complex questions (also based on Freebase), which contains 3,098 samples in the training set and 1,639 samples in the test set. We also divide the test set into Dataset III which contains 1,233 samples of normal questions and Dataset IV which contains 406 samples of QWAS.

The WebQuestions dataset [50] is a KBQA dataset of the mixture of single-relation and complex questions (also based on Freebase), which contains 3,778 samples in the training set and 2,032 samples in the test set. It is selected to evaluate the performance of our model for mixed types of questions.

4.2. Experiment Setting

Our model is based on the BERT-base model where the number of transformer blocks is 12, the hidden size is 768, and the number of self-attention heads is 12. For each BERT-based classification model in this paper, parameters are trained by an Adam optimizer [51] with a learning rate of 5e − 5, a loss function of sparse categorical crossentropy, an activation function of tanh, and a batch size of 64. In addition, a dropout layer with 0.1 dropout rate and a SoftMax layer with 2 units (3 units in Pruning Model II) are appended to prevent overfitting and output classification results, respectively.

Each of submodels (except Pruning Model III) in our MGPM is trained independently by the training set in SimpleQuestions and the whole MGPM is evaluated by the test set of all datasets. During the training of Pruning Models I and IV, for each question, we generate 1 positive sample (golden domain or subject) with label 1 and at most 5 negative samples (randomly selected from all candidates) with label 0. In Pruning Model II, for each question, we generate 1 sample (golden relation) with label 1, at most 3 samples (randomly selected from relations unmatched to the golden subject) with label 0 and at most 3 samples (randomly selected from relations matched to the golden subject) with label 2. All models are trained in 3 epochs (approximately 40 minutes per epoch) on a computer with an AMD R9-5950X CPU and a GeForce RTX 3090 GPU.

In general, we choose BERT-CRF (one of the most popular fine-trained models in NLP field), which is achieved by bert4keras (https://github.com/bojone/bert4keras), EGP [52] (proposed in 2022, one latest fine-trained model in NER task), and several models without pretrained models (e.g., CFO, BiLSTM-CRF, etc.) as the comparison of our proposed model. For dataset, we choose SimpleQuestions (a widely used dataset for single-relation questions), WebQuestionsSP (a widely used dataset for complex questions), several subsets of them (Datasets I–V), and WebQuestions (dataset of mixed single-relation questions and complex questions).

It seems that there are other countless NER methods and datasets could be the comparison of our model; however, many of them are incompatible with our model, for the following reasons:(i)As the difference between a NER task and a SR task under KBQA (introduced in our introduction), many NER models would output all possible entities but not the only subject entity to an input question, which is effective for a NER task but inapplicable for a SR task.(ii)Although there are some NER models (latter than BERT) which outperform BERT in NER tasks and could be also employed in a SR task, e.g., ELECTRA, T5, etc., they fail to outperform BERT in our experiments (not shown in our paper). Even EGP could not outperform BERT on all datasets. In fact, BERT-CRF has still been the default model for the SR task under KBQA in many researches up to now, and it is significant for our model to outperform BERT, EGP, and many other models in our experiments.(iii)SimpleQuestions, WebQuestionsSP, and WebQuestions are widely used open-domain KBQA datasets, which have been the evaluator of a general KBQA model up to now. The comparison of different models on these datasets has been a common way to evaluate different KBQA models. Just like many researches of general KBQA, we also select these datasets to evaluate our model (in fact, even the latest research [53] also chose SimpleQuestions as an evaluator).(iv)In addition, catastrophic forgetting [54] is a feature of neural network models, especially pretrained models. As a result, the best versatility and the best performance are usually incompatible in many tasks. For example, the latest research [53] proposed a model with a high versatility of multilingual QA tasks, whereas it fails to outperform even a traditional model [55] if we only focus on the performance on the SimpleQuestions dataset. As a result, we only select models which focus on the best performance on some specific type datasets as the comparison of our model and experimental results show that our model achieves a better performance, which shows the effect of model. In addition, our model also shows the versatility in the complex KBQA task (WebQuestionsSP dataset) where compared models are inapplicable, which further shows the effect of our model.

4.3. Experiment Results

Accuracy, recall, precision, and F1 score are all optional indicators in deep learning. In the KBQA task, as the search space for each input is usually different and some other reasons, accuracy is usually selected as the indicator in many researches. In this paper, we also select accuracy as the indicator to evaluate our model. Experimental results for subject recognition are shown in Table 1. In Table 1, MGPM means our model is employed as a standalone model, and it achieves the accuracy of 89.5% on SimpleQuestions (SQ), 52.8% on WebQuestionsSP (WSP), and 52.1% on WebQuestions (WQ), which shows that the SR task for complex questions is more difficult than that for single-relation questions. For normal questions, MGPM achieves the accuracy of 92.4% on Dataset I and 61.6% on Dataset III, which both outperform the accuracy on original datasets. It shows that if we could ensure that all input questions are normal questions in practical, a KBQA model would achieve a better performance. For QWAS, MGPM achieves the accuracy of 46.0% on Dataset II and 26.1% on Dataset IV, which are both much lower than the accuracy on normal questions. It shows that the SR task for normal questions is much easier than that for QWAS.

BERT-CRF is one of the most popular models in NER tasks, which also could be employed in SR tasks after fine-tuning. It achieves the accuracy of 90.8% on SimpleQuestions, 60.0% on WebQuestionsSP, and 58.4% on WebQuestions, which outperform MGPM (by the margin of 1.3%, 7.2%, and 6.3%). For normal questions, it achieves the accuracy of 97.0% on Dataset I and 79.8% on Dataset III, which further outperform MGPM (by the margin of 4.6% and 18.2%). However, it is inapplicable for QWAS and cannot answer any questions on Datasets II and IV.

Fortunately, MGPM not only could work as a standalone model but also could work as a plug-in approach to another KBQA model. “+ MGPM” in Table 1 means that a question is answered by a traditional KBQA model (e.g., BERT-CRF) first, and in the case that no answers could be found (no-matched or no subjects are recognized), MGPM will be employed to answer the question as the alternative model. As an integrated model, BERT-CRF + MGPM achieves the accuracy of 94.4% on SimpleQuestions, 68.6% on WebQuestionsSP, and 63.7% on WebQuestions, which both outperform BERT-CRF (by the margin of 3.6%, 8.6%, and 5.3%). For normal questions, it achieves the accuracy of 98.4% on Dataset I and 87.3% on Dataset III, which also outperform BERT-CRF (by the margin of 1.4% and 7.5%). For QWAS, it achieves the accuracy of 36.0% on Dataset II and 11.8% on Dataset IV, which both outperform BERT-CRF but fail to outperform standalone MGPM, because BERT-CRF outputs wrong answers to some QWAS and MGPM is not employed to these questions.

EGP is another NER model which is proposed recently and could also be employed in SR tasks. Experimental results show that it achieves the better performance to complex questions than BERT-CRF but shows worse performance to single-relation questions. The integrated model EGP + MGPM outperforms EGP on all datasets and also outperforms BERT-CRF on WebQuestionsSP and WebQuestions.

In general, as a standalone model, MGPM achieves highest accuracies of 46.0% and 26.1% on datasets of QWAS (II and IV). However, on datasets of normal questions (I and III), MGPM fails to outperform baseline models. As a result, to original datasets, it is a better strategy to integrate MGPM and a baseline model. In this strategy, a question is answered by a baseline model and in the case that no answers could be found (no-matched or no subjects are recognized), MGPM will be employed to answer the question as the alternative model. Experimental results show that BERT-CRF + MGPM outperforms the baseline BERT-CRF by margins of 3.6%, 8.6%, and 5.3% on the whole SimpleQuestions, WebQuestionsSP, and WebQuestions, and EGP + MGPM outperforms the baseline EGP by margins of 4.7%, 8.1%, and 4.8% on these datasets.

In practice, it is usually unknown whether the subject in an input question is abnormal and sometimes a wrong-matched subject would be found by the general model. As a result, the performance of the integrated MGPM model is worse than standalone MGPM on Datasets II and IV. If the quality of input questions can be evaluated, it is better to employ the standalone MGPM for questions in poor quality (e.g., translated from other languages).

In addition, Table 1 also shows that different models achieve different performance on different datasets: Efficient GlobalPointer outperforms BERT-CRF on SimpleQuestions while BERT-CRF outperforms Efficient GlobalPointer on WebQuestionsSP. However, no matter which baseline model is chosen, the integrated MGPM model would always outperform the corresponding baseline model on a whole dataset. In other words, our MGPM is compatible with various KBQA models.

Then, a general BERT-based RE model could be employed for relation prediction to single-relation questions in a KBQA task. Experimental results for overall accuracies (%) of our models and traditional models on SimpleQuestions are shown in Table 2. Among these methods, KEQA [56] and M3M [53] are recent methods which show satisfactory performance in multilingual KBQA or other KBs, whereas they fail to outperform the traditional method BiLSTM-CRF + BiLSTM [55] on the SimpleQuestions dataset. We choose BERT-CRF by bert4keras as the baseline model, which is widely employed in various KBQA tasks and achieves good performances, and the integrated MGPM model further outperforms it by a margin of 3.4%, showing the effectiveness of our models.

For complex questions, relation prediction and subject recognition are usually considered as two individual tasks, and rather than the overall accuracy, the accuracy for relation prediction is more often chosen to evaluate a model to answer complex questions. As a result, it is unnecessary to conduct additional experiments to evaluate the overall accuracy on WebQuestionsSP.

To further evaluate the robustness of our model, we delete 5%, 10%, and 15% words, respectively, in questions of the SimpleQuestions dataset and evaluate our proposed model (BERT-CRF + MGPM) and the baseline model (BERT-CRF) on these noisy datasets. Experimental results for accuracies of subject recognition are shown in Figure 4. The baseline model achieves accuracies of 90.8%, 80.4%, 72.0%, and 64.3% respectively, on datasets of deletion rate of 0% (original dataset), 5%, 10%, and 15%, respectively. Our proposed model achieves accuracies of 94.4%, 84.2%, 76.3%, and 69.2%, respectively, on these datasets, which outperform the baseline model by a margin of 3.6%, 3.8%, 4.3%, and 4.9%, respectively. In general, our proposed model outperforms the baseline model on all noisy datasets, and the outperformance shows a positive correlation with the deletion rate, which shows the robustness of our proposed model.

4.4. Parameter Determination and Ablation Experiments

In our models, are hyperparameters which should be optimized in the experiment. A higher parameter value means a stronger pruning, less candidates in Pruning Model IV, and less prediction time, so the values should be as high as possible, and in the case that several values lead to a similar accuracy, the highest value will be selected. As the output of Pruning Model III influences the direct input to Pruning Model IV, we first set and gradually decreasing . As it would take too much time to evaluate all combinations on the whole dataset, we only evaluate them on a dataset of the most significant questions. This dataset (Dataset V) contains all 1,334 questions where the general model find no answers (no-matched or no subjects are recognized) on SimpleQuestions. In the integrated MGPM model, MGPM is employed to answer these questions so a well-performed model on these questions leads to a well-performed integrated MGPM model on the whole dataset. We evaluate our model with these hyperparameter values on Dataset V, and experimental results in Table 3 show that and lead to the similar accuracy, so the optimized value of is 0.6. Compared with , and show less influence to the result, and we evaluate several value combination and results show that shows the best accuracy (lower values lead to the similar accuracy and are not shown in the table.) As a result, we set as the optimized values of the hyperparameters in our model.

In Table 3, “Candidate” means the average number of candidates to a question in Pruning Model IV, which has a positive correlation to the prediction time. Among the 4 submodels in our model, Pruning Model IV is indispensable because it directly outputs the recognized subject. However, if we only employ Pruning Model IV , there would be a huge number of candidates (3,972k), which leads to unacceptable time and space cost. After Pruning Model III is included , the number of candidates is much smaller (109k), but still leads to unacceptable time and space cost. In these cases, our computer cannot calculate the accuracy. After Pruning Model I is included , the number of candidates is further smaller (7,9k), and the accuracy is able to be calculated (55.6%), which is lower than the accuracy of the whole model . As a result, the combination of all these 4 submodels has the best performance, which is the whole MGPM.

In Pruning Model II, question-relation pairs are classified into three categories (Strategy I) instead of two categories (Strategy II). To single-relation questions, after the subject is found by MGPM, we prefer to find the relation by a general RE model (Strategy III) rather than Pruning Model II in MGPM (Strategy IV). As shown in Table 4, experimental results show that Strategy II outperforms Strategy I by a margin of 1.4% and Strategy III outperforms Strategy IV by a margin of 10.9%. As a result, we choose Strategy II and Strategy III in our MGPM.

5. Discussion

In our experiments, we choose KBQA models based on BERT-base as baseline models and experimental results show the effectiveness of our MGPM and integrated models. However, our model is not confined to BERT-base and it could also be integrated with other pretrained models (e.g., BERT-Large, ELECTRA) or other methods without pretrained models. As long as there are some questions where a KBQA model fails to find answers, our model could be employed to answer them efficiently. As a result, our model has strong adaptability to the integration with various KBQA models and integrated MGPM models would probably outperform original models.

In practice, an input question to a KBQA system could contain abnormal expressions from various users. Besides, it could be fed to the system after multiple processes of transmission, transformation, or translation. As a result, it could be common for a practical KBQA system to answer QWAS. Our experiments have shown that MGPM achieves a SR accuracy of 89.5% while the baseline BERT-CRF achieves a SR accuracy of 90.8% on the SimpleQuestions dataset which contains 6.1% QWAS. Therefore, we could infer that MGPM would outperform the baseline model on the dataset (if similar to SimpleQuestions) which contains more than 9.1% QWAS. In these cases, our MGPM has great practical value.

However, in some particular cases such as medical QA, it is risky to output a uncertain answer. Users even prefer no answers than an unreliable answer. In these cases, “No Answer” is the safe and acceptable output to a QWAS or an imprecise question, so MGPM is inapplicable. Besides, in the case that the time and space cost is strictly restricted, the integrated MGPM (standalone MGPM, general KBQA models, or even deep learning models) is also not appropriate for deep learning itself requires higher time and space cost than traditional methods. Instead, traditional methods such as semantic parsing-based models or query models would be employed.

In fact, integrated MGPM, standalone MGPM, and traditional KBQA models have their own advantages and disadvantages, respectively: integrated MGPM is model-agnostic and compatible with various KBQA models; standalone MGPM shows better performance in answering QWAS; sometimes we could also find out a traditional KBQA model which meets all requirements of a specific task. In summary, there are some strategies about which models to choose in different situations:(i)In most situations, especially in situations where a KBQA model (known or unknown) has been employed, integrated MGPM is a better choice, as it is compatible with various KBQA models.(ii)In the situations where QWAS are frequently inputted or the quality of the input is difficult to guarantee, standalone MGPM should be chosen, as it shows better performance in answering QWAS.(iii)In some situations where no answers are more acceptable than an unreliable answer, or the time and space cost is strictly restricted, MGPM is not so appropriate. Instead, traditional KBQA models should be chosen.

6. Conclusion

Among all questions in the original dataset of SimpleQuestions and WebQuestionsSP, there are mainly normal questions (Datasets I and III) and some QWAS (Datasets II and IV). A traditional KBQA model (e.g., BERT-CRF) is effective to most normal questions (an accuracy of 97.0% for Dataset I and 79.8% for Dataset III), but it is inapplicable to QWAS (an accuracy of 0% for Datasets II and IV). In most cases, it is difficult to recognize QWAS among all input questions, and even if QWAS could be recognized, a traditional KBQA model still cannot answer it. As a result, in practical applications, a KBQA model is simply employed and it can find answers (right or wrong) to some of the input questions and fails to answer the others of them (Dataset V in our experiment).

To improve the performance of a KBQA model, in this paper, we propose a method for the SR task under KBQA when general models fail. In our model, relations in the KB are pruned to a set by two pruning submodels (I and II). Then, the set of all subjects in the KB is pruned by the constraint of these well-matched relations and two other pruning submodels (III and IV). After this multigranularity pruning process, best-matched subject could be recognized. Then, the question could be answered by a general KBQA model based on the recognized subject. In general, there are mainly the following advantages of our model:(i)In the case of normal questions and questions which could be answered by traditional models, our model is also effective and even outperforms traditional models.(ii)In the case of QWAS and questions where traditional models fail, our model could still answer some of these questions correctly, whereas traditional models achieve an accuracy of 0%.(iii)After finishing the process of training, our model is effective to different types of questions (single-relation questions and complex questions) and different datasets (SimpleQuestions, WebQuestionsSP, WebQuestions, and Dataset I–V) without more training, which shows the versatility of our model.

Inevitably, there are also some weaknesses of our model:(i)In some particular cases where it is risky to output an uncertain answer (e.g., Medical QA), our model is inapplicable because it always tries to give answer to all questions.(ii)In the case that the time and space cost is strictly restricted (e.g., industrial systems), our model is also inappropriate because deep learning requires higher time and space cost than traditional methods. Instead, traditional methods such as semantic parsing-based models or query models would be employed.

As future work, studies will be conducted to reduce the influence of aforementioned weaknesses of our model and extend our model to multilingual KBQA tasks.

Data Availability

Data used in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the Natural Science Foundation of China under Grant Nos. U21A20491, U1936109, and U1908214.