| | Input initial seed keywords from the literature |
| | Stage 1: BERT word vector similarity selection |
| (1) | Initialize empty similar words vocabulary |
| (2) | For each seed keyword do |
| (3) | Collect corresponding baidu baike text |
| (4) | Construct keywords vocabulary based on JIEBA segmentation |
| (5) | Vectorize seed keywords and potential keywords in vocabulary based on BERTvec |
| (6) | For each keyword in potential keywords vocabulary do |
| (7) | Calculate cosine similarity score between and |
| (8) | IF threshold then |
| (9) | Add to similar words vocabulary |
| (10) | End for |
| (11) | End for |
| (12) | Output similar words vocabulary |
| | Stage 2: NEZHA word importance selection |
| (13) | Initialize empty similar & important vocabulary |
| (14) | Collect data from CLUE data set in the form of (keywords, text) |
| (15) | Randomly select words from text as pseudo-keywords at a ratio of 1 : 1 |
| (16) | Build finetune data set (Keyword/Pseudo-Keyword, text, label) as |
| (17) | Construct training set and development set from data set |
| (18) | Finetune BERT-TensorFlow, BERT-MindSpore, NEZHA-MindSpore in training set |
| (19) | Select the best performing model (NEZHA-MindSpore) by precision on the development set |
| (20) | For each keyword in similar words vocabulary do |
| (21) | Calculate context importance score based on model |
| (22) | Add and to similar and important vocabulary |
| (23) | End for |
| (24) | Keep words with top 100 importance scores in vocabulary |
| | Output similar and important vocabulary |
| | Stage 3: LSTM stock index forecast |
| (25) | For keyword in do |
| (26) | For lagging term in 1 to 10 do |
| (27) | Calculate lagged search index time series |
| (28) | End for |
| (29) | Use Pearson correlation coefficient to select the most related lagged term |
| (30) | End for |
| (30) | Train LSTM to forecast CSI300 stock index on the 2215-day train data set |
| (31) | Calculate and compare model RMSE on the 243-day test data set |
| | Output model RMSE |