Abstract
This paper designs a hotel human-computer interaction (HCI) system which is based on deep learning. The whole system mainly includes face detection, speech synthesis, and speech recognition. Real time face detection is realized by transplanting the Opencv library into the Android system and combining that with the AdaBoost algorithm. Furthermore, the local speech recognition is realized through the special chip for speech recognition hbr740, while the local speech synthesis is realized through the special chip for speech synthesis syn6288. Subsequently, the massive speech resources are obtained through the network connection into the iFLYTEK’s open platform, which can realize online speech recognition, semantic understanding, and speech synthesis. In the data flow phase, we transmit data to the lower computer through the serial port of tiny4412 in order to realize the motion control of the lower computer. This is achieved through the SQLite database for system voice interaction and motion control, which is built through the jar package of the Litepal and POI. Finally, a hybrid voice interaction system is designed combining the local voice interaction with the online voice interaction. Through numerical simulation, the suggested system is tested to verify the feasibility of the hybrid voice interaction scheme. We observed that when the network is in good condition, the speech recognition rate of the whole system reaches as high as 94.67%; while without the network, the speech recognition rate can still reaches up to 84.67%. The attained outcomes demonstrate the superiority of the suggested hybrid system.
1. Introduction
Over the past decade, smart hotels have gradually developed, but they are still in the initial exploring stage. Problems such as high personnel costs, disordered management, and low service quality are emerging in an endless stream in the traditional hotel industry. Technology iteration and intelligent transformation are inevitable. Artificial intelligence, face recognition, voice interaction, and other technologies have applied to the hotel industry. Relying on artificial intelligence, the Internet of things (IoTs), big data, edge, machine learning, cloud computing, and other emerging technologies, the smart hotel has established a smart service system to intelligentize its operation, management, and control and bring users a mild and high-quality service experience. When introducing technologies and hardware such as intelligent devices, intelligent robots, and digitalization, it is also necessary to consider consumer's emotional needs and provide humanized services. The connotation of the smart hotel is people-oriented, focusing on user, combining science and technology with human nature with the aim to optimize the user experience.
The smart hotel can improve hotel management efficiency, intelligently manage hotel equipment and personnel, and reduce equipment failures and personnel errors. Big data may be used to gather and combine occupancy data, and precision marketing can be used to satisfy user. In addition, the smart hotel has implemented remote management, data statistical analysis, energy preservation, CO2 lessening, low carbon and ecological fortification, independent monitoring and system and module updating, and independent monitoring and upgrading, which has significantly decreased the cost of hotel operation and maintenance.
In the contemporary age, the growth of the Internet and mobile devices has opened up new markets for the traditional hotel industry. Various tourism app hotels have gradually replaced calling the front desk and booking through the computer network, which not only facilitates consumers but also drives the growth of the tourism economy. Most hotels in China still use the front desk for manual check-in. The room facilities are fixed and the update speed is slow and the check-out procedures are cumbersome. To a large extent, they have been unable to meet the personal needs of customers. In order to improve the user experience, many traditional hotels have gradually started to innovate and introduce new measures in terms of hotel equipment in combination with science and technology. The traditional hotel industry has high training costs, long working hours, redundant personnel and rapid loss, low wages, and no sense of identity, resulting in low work efficiency. Users’ data are scattered and privacy is not guaranteed. The selection range of room types is small, which is difficult to meet personalized needs. The mature large hotel group has a huge industrial chain, and it is difficult to quickly update the infrastructure, resulting in slow transformation. Because small and medium-sized hotels only simply decorate and upgrade equipment, they excessively pursue promotion and marketing, neglect service, and do not invest in design from the perspective of users, resulting in a poor experience.
The rapid development of embedded technology, mobile Internet, and artificial intelligence provides the basic support of software and hardware for the hotel service robot. The International Federation of Robotics (IFR) defines fully autonomous or semiautonomous robots that perform useful service activities for human beings but do not engage in production as service robots. In China, the definition of a service robot is relatively narrow, and the fully autonomous or semiautonomous robot that provides necessary services for humans and equipment is called a service robot. Service robots can be distributed into three categories: (i) entertainment service robots, (ii) family service robots, and (iii) professional service robots. Among them, professional service robots are predominantly used in specific places to replace people to complete tasks. Hotel service robots belong to one of them. At present, the labor cost of the hotel industry is high and the turnover rate is high. According to the “big data” platform hotel, the labor cost has been accounting for more than 30% in four of the past six years, and the labor cost is a major expenditure item of the hotel. In addition, the average monthly turnover rate of hotel staff in the past six years has also been around 4%, which has been in a relatively high state [1].
Due to cultural differences in different countries, the development of smart hotels also has its own emphasis. The first unmanned hotel in China, Le Yi’s “unmanned smart hotel” has realized the full induction intelligent room, which is completed by the electronic system from the reservation, check-in to check-out [2]. Happy to stay in view of the pain points of the homogenization of traditional hotels and high labor costs, we designed a new “unmanned Hotel” model to reduce costs from the aspects of “de lobby,” reducing staffing ratio and increasing online publicity. The most notable feature of the smart home-focused Murano Resort Hotel in Paris, France, is that guests may customize a number of profile settings using the light controller by their bedside. These modes can be adjusted to reflect their own routines and tastes. The Peninsula Hotel in Tokyo, Japan, has the ideal design. Throughout the hotel, there are several buttons with various purposes. By pushing the buttons, users may answer the phone hands-free and learn the weather outside and how to dress for it. In the United States, the Seattle Hotel 1000 includes a full intelligent infrastructure that can accommodate users’ demands whenever they arise. China’s advancement in smart hotels falls behind that of industrialized nations such as Europe and the United States. Several upscale hotels have started researching low-carbon environmental protection and intelligent information since 2009. Poor user experiences are particularly typical in small and medium-sized hotels because of limited capital expenditures and a lack of knowledge of the usage scenarios for intelligent systems. The majority of international smart hotels engage much in R&D (research and development), create systems that adhere to their own growth and management philosophy, and significantly enhance the quality of life for their guests [3]. The fundamental contributions are as follows:(1)We design a hotel human-computer interaction system based on deep learning, and the suggested system includes face detection, speech synthesis, speech recognition, and so on,(2)real time face detection is realized by transplanting the Opencv library to the Android system and combining it with the AdaBoost algorithm in the Opencv library,(3)the SQLite database for system voice interaction and motion control is built through the jar package of Litepal and POI,(4)combining local voice interaction with online voice interaction, a hybrid voice interaction system is designed and tested to verify the feasibility of the hybrid voice interaction scheme.
The rest of the paper is structured as follows: in Section 2, we discuss some related work and state-of-the-art mechanisms. In Section 3, several applications of deep learning in hotel human-computer interaction are deliberated. In Section 4, the hotel’s human-computer interaction system is validated through simulations and tests. Finally, Section 5 concludes this paper and discusses several key directions in, which the work can be extended in the future.
2. Related Work
At home and abroad, hotels’ investment in intelligent voice research and development. In 2011, apple entered the field of intelligent voice with Siri, and vigorously promoted the application of Siri in automotive electronics and other fields. In 2014, Google launched the Android wear project, in which the voice function is an important part [4, 5] and released Google now and Google glass. In 2016, Google integrated Google now and OK Google and launched the voice assistant Google assistant [6]. In 2014, Microsoft released the Xiaobing robot and Cortana (Xiaona) with leisure entertainment and personal assistant functions respectively and integrated the two. Domestically, on August 1st, 2012, the China voice industry alliance, jointly initiated by 19 enterprises including iFLYTEK, Huawei, Lenovo, China Mobile, China Unicom, and China Telecom, was officially established in Beijing. At present, the voice technology research manufacturers in China are mainly divided into three categories: one is the traditional voice technology manufacturers, such as iFLYTEK, Jietong Huasheng, and Zhongke modular [7, 8]. The second category is the current Internet enterprises, such as Tencent, Alibaba, Baidu, and Sogou, which actively promotes their original Internet business, obtain the original voice technology through cooperation or acquisition, and add their own relevant elements and ideas, making the current voice technology more and more mature. The third is the small and medium-sized companies that are still in the entrepreneurial stage, such as yunzhisheng and sibichi. They only focus on those related to their own business areas, but at present, most of them are involved in the public domain and do not involve the dedicated hotel field [9, 10].
Finding and managing feature interactions that break system requires and thus failures may be difficult. By recasting this issue as a search-based test generation problem, the authors of [11] provide a method to identify feature interaction failures. A hybrid PRNN/KNN algorithm is used to create an improved human voice emotion recognition system [12]. The goal of [13] is to create an articulated language processing method for human-machine interaction-based smart device diagnostics. Although the hybrid CTC/attention ASR system benefits from both CTC and attention architectures during both phases i.e., training and decoding, due to its attention approach, the CTC prefix possibility and bidirectional encoder, however, it is still difficult to employ for streaming speech recognition. The authors of [14] have suggested a truncated CTC prefix probability (T-CTC) to stream its CTC branch and a stable monotonic chunkwise attention (sMoChA) to stream its attention branch. Similar to [15], which offers a guideline for the growth of arti A convolutional neural network (CNN) and recurrent neural network (RNN) technology, the state-of-the-art automatic lip-reading technology, is used to build a speech training system for patients with dysphonia and hearing loss [16]. Using noncooperative game theory and robust optimization, the authors of [17] describe a novel coordinated energy management strategy for hybrid AC/DC distribution systems with microgrid clusters that take into account various market actors. The use of hybrid robust feature extraction approaches for spoken language identification (LID) systems is also presented and encouraged by researchers in [18]. In order to study speech recognition, the researchers in [19] offer an automatic speech recognition (ASR) system, which is, in fact, founded on a few single and/or (potentially) multiple modalities, including audio and electroencephalogram (EEG) inputs. The authors of [20] intend to investigate a novel paradigm of human-computer interaction and conduct an in-depth study on the creation and implementation of intelligent services in the big data environment. This is based on the growing idea of “Hybrid Intelligence.”
3. Application of Deep Learning in Hotel Human-Computer Interaction
3.1. Deep Learning
Deep learning is a group of algorithms and a subset of machine learning. It is often referred to as deep structured learning or deep machine learning. Today, deep learning is a big success in the field of algorithms, not only on the Internet and artificial intelligence but also in all major fields in life that can reflect the great changes led by deep learning, such as the most advanced speech recognition, visual object recognition, and other fields. In this paper, we first understand deep learning from some concepts involved in deep learning. Finally, we mainly introduce two kinds of deep learning networks, convolutional neural networks, and cyclic neural networks, which are most widely used in image recognition and language processing [21].
Deep learning is now the most popular technique, and voice recognition and computer vision are the two fields where it performs best. Convolutional neural networks are a common example in computer vision. Convolutional neural networks are a specific kind of deep feedforward network that perform better at generalization than fully connected neural networks and are simpler to train. A color image with three color channels is an example of the type of multidimensional array data that the convolutional neural networks (CNN) are intended to analyze. One-dimensional arrays are used to represent signals and sequences such as speech, two-dimensional arrays are used to depict pictures, and three-dimensional arrays are used to represent video or sound images. Next, we demonstrate how to comprehend convolutional neural networks using a color channel in a color picture. As can be seen in the picture below, the convolutional neural network really consists of many layers, with the essential functions being convolution, pooling, full connections, and recognition. Figure 1 depicts the deep learning architecture.

The following characterizes the hotel human-computer interaction association problem from the four aspects of the hotel human-computer interaction association scene input data, association feature input data, association decision output data, and loss function so as to model the hotel human-computer interaction association problem as a classification and identification problem. The specific modeling methods are as follows:(1)Based on the hotel human-computer interaction information at time t, the input data of hotel human-computer interaction-related scenes are calculated and constructed. The data have an image data structure. The specific calculation method is as follows:(a)according to the importance of the information obtained from the hotel human-computer interaction, the hotel customer positions at the time t of the source are sorted internally, and the serial number l of each customer in the source is obtained using the following formulas: (b)among the guest information sources of different hotels, calculate the distance between any two guests at time t in sequence according to the sequence of the guest serial number and take it as the element in row i and column k to build the input data of the human-computer interaction scene of the associated hotel. The is calculated using the following formula:(2)For the human-computer interaction point I of guest location information 1 and the human-computer interaction point J of guest location information 2 to be associated, the statistical distances of the two human-computer interaction points at the location, information sharing point, and internal information sharing are calculated respectively as the associated feature input data , the specific calculation method is mathematically illustrated using the following formula:
Based on the related scene input data , the related feature input data , the related decision output data , and the loss function L obtained from the human-computer interaction problem representation of the hotel, the deep learning technology is adopted to design the human-computer interaction algorithm of the deep learning hotel. The steps are as follows:(i)Step 1: first, set the maximum number of hotel human-computer interactions as N, then preprocess the input data of the three associated scenes at time t with zero value peripheral filling so that they are the same size, all N × N, and finally merge them into N × N × 3 tensor .(ii)Step 2: the deep convolution neural network is used to embed to obtain the vector representation VC of the associated scene.(iii)Step 3: the multi-layer neural network is used to upgrade the dimension of the associated feature input data to obtain the high-dimensional representation VD of the associated feature.(iv)Step 4: combine the hotel-associated scene vector representation VC and the associated feature high-dimensional representation VD and use the multilayer neural network to process. The network output is , and the excitation function of the last layer of the neural network is the sigmoid function.(v)Step 5: using typical datasets, according to the cross-entropy loss function L, train the neural network according to steps 1–4 to obtain the weight of the neural network.
Among them, for steps 2–4, the typical network structure is shown in Figure 2, including the hotel scene embedded network, feature dimension upgrading network, and associated decision network. According to the complexity of the correlation problem and the training effect of the neural network, the network structure can be further adjusted.

3.2. Design of Hardware and Software
The mainstream speech recognition chips in the market include icroute ld3320, iFLYTEK xfs5152ce, and Shanghai Xinfeng micro hbr740. Because the command word list of iFLYTEK xfs5152ce cannot be dynamically edited, the manufacturer only accepts the customization of customer command words, so fs5152ce is not selected as the speech recognition module. Compared with hbr740, ld3320 has the same recognizable length each time, but hbr740 recognizes more candidate recognition sentences each time, so hbr740 is selected as the speech recognition chip [22].
The whole system takes android as the platform and provides the android5.0.2 system on the friendly arm tiny4412. It mainly uses an SD card to burn the bootloader, Linux kernel, Android root partition image, Android system partition image, and Android data partition image. Since tiny4412 has insufficient resources to develop software, a cross-development mode is required when adding a dynamic link library to tiny4412’s Android system. The so-called cross-development mode is to edit and compile the software on the host (PC) and then run and verify the program on the target board (embedded device). The target development board of this article is tiny4412 [23]. The host of this article uses the Linux system installed on the virtual machine. Figure 3 shows the cross-development mode established in this paper.

On the PC operating system windows, first, edit the program on SourceInsight, and then use FTP (File Transfer Protocol) to transfer it to the virtual machine Linux system. On the Linux system, arm Linux GCC is used to compile the program and then FTP protocol is used to transfer the program to Windows system. On the windows system, use the ADB (Android debug bridge) tool to transfer the program to tiny4412 and run it.
4. Hotel Human-Computer Interaction System Test
4.1. Face Detection Test
Open the face detection activity to test the face detection function module. The expected phenomenon is that when the camera detects a face, it will draw a green box. As shown in Figure 4, face detection is carried out on static images, and the test results of face detection of the system meet the expectations. As shown in Figure 5, face detection is performed on the dynamic image. When the people in the lens move continuously, successful face detection can also be achieved, and the detection results meet the expectations [24].


4.2. Voice Interaction Test
The voice interaction system designed in this paper is used in hotel service, facing a variety of hotel guests. When hotel guests interact with the service robot, they must speak naturally as usual. In order that the system can meet the needs of different guests’ speaking characteristics, that is, as long as the meaning of voice commands is the same and the expression is different, the system can still grasp the keywords to recognize voice commands. This paper considers the input of speech recognition entries with different expressions in the same sentence when editing the entry project of the local speech recognition module hbr740. During the cloud grammar recognition, the common statements of people are simulated when constructing the grammar file. Both the keywords of command recognition and the irrelevant words are written. Therefore, the speech recognition range of the system has been expanded, but whether it has an impact on the speech recognition effect is unknown. Whether this paper can still ensure a certain recognition accuracy under the premise of maintaining natural interaction, this paper has carried out relevant experiments.
A total of 10 participants were selected [25]. Each participant said 15 relevant voice commands at a time to test the local voice recognition system based on the hbr740 voice recognition chip and the cloud voice recognition system based on cloud grammar recognition. The test environment was in an office environment. All participants’ instructions have the same meaning, but the expression is based on the participants’ personal speaking characteristics. The average recognition rate of each instruction is counted to see whether it can meet the requirements of recognition accuracy. The test results are shown in Figure 6.

According to the content in Figure 6, we can see that the average recognition accuracy of local hbr740 is 81.33%, the average recognition accuracy of cloud syntax recognition is 92.67%, and the recognition rate of cloud syntax recognition is higher. Because various nonkeywords are added during the construction of cloud grammar, which can be freely combined, covering people’s ordinary living methods. However, local speech recognition is limited by the number of words. Many statements are not fully included, and the recognition rate is relatively low. The advantages of local speech recognition hbr740 over cloud speech recognition are not limited by network conditions. The cloud speech recognition rate of the whole system is relatively good, and the local speech recognition rate is also within an acceptable range.
5. Conclusions and Future Work
This paper mainly introduces the hotel human-computer interaction system based on deep learning and tests and applies the interaction system. First, test each module of the system face detection for static images and dynamic images. Select 15 typical voice problems that are often needed by hotels, and test the voice recognition based on the hbr740 voice recognition chip and cloud grammar recognition, respectively. It is found that the grammar recognition effect constructed by Xunfei cloud grammar recognition is better; select the answers to 15 typical voice questions often needed by the hotel and test them based on the voice synthesis module syn6288 and the voice synthesis system based on cloud voice synthesis. It is found that the local and cloud voice synthesis effects are very good, but the voice synthesis composed of iFLYTEK and voice synthesis plays better and tests the semantic understanding of iFLYTEK in the system and those within the selected development skills can correctly answer questions. Secondly, according to the test results, a hybrid speech interaction system is built, and the hybrid speech recognition with a network and the local speech recognition without a network are tested. It is found that the speech recognition rate has been improved.
In the future, it can be considered to develop more robust and effective techniques using more sophisticated deep learning techniques including deep neural network (DNN), convolutional neural network (CNN), and graph convolutional network (GCN). Further integration of the attention mechanism to improve the algorithm performance in terms of training and prediction duration using emerging edge and cloud computing technologies is recommended. To increase predction accuracy, appropriate aggregation techniques are suggested to reduce the amount of collected data by removing redundant data.
Data Availability
The corresponding author can provide the datasets used and analyzed during the current study upon reasonable request.
Conflicts of Interest
The author declares that there are no conflicts of interests.