Abstract

How the human brain does recognition is still an open question. No physical or biological experiment can fully reveal this process. Psychological evidence is more about describing phenomena and laws than explaining the physiological processes behind them. The need for interpretability is well recognized. This paper proposes a new method for supervised pattern recognition based on the working pattern of implicit memory. The artificial neural network (ANN) is trained to simulate implicit memory. When an input vector is not in the training set, the ANN can treat the input as a “do not care” term. The ANN may output any value when the input is a “do not care” term since the training process needs to use as few neurons as possible. The trained ANN can be expressed as a function to design a pattern recognition algorithm. Using the Mixed National Institute of Standards and Technology database, the experiments show the efficiency of the pattern recognition method.

1. Introduction

Pattern recognition methods can be divided into two categories: two-stage and end-to-end. Most traditional pattern recognition methods are two-stage: feature extraction and pattern classification [1]. Feature extraction reduces the number of resources that are required to describe a large amount of raw data. The first step is to identify the measurable quantities that make these training sets distinct from each other. The measurements used for the classification, such as mean value and the standard deviation, are known as features. In general, some features constitute a feature vector. Some information gets lost since feature extraction is not a lossless compression approach. The lost information cannot be used for pattern recognition. Therefore, how to generate features is a fundamental issue. In the feature vector selection, two crucial issues are the best number of features and the classifier design [2]. For example, feature selection plays an important role in text classification [3]. The complexity of the practice data makes it arduous to use two-stage methods.

Nowadays, deep neural networks can be trained end-to-end [4]. Raw data can reserve all information of the pattern. Inspired by the biological neural networks that constitute animal brains, the artificial neural network (ANN) is applied to do image classification [5], speech separation [6], forest fire prediction [7], etc. However, a major drawback of neural networks is the black-box character. Explaining why the neural networks make a particular decision is formidable. The knowledge representation of neural network is unreadable to humans [8]. The exact reason why trained deep neural networks can implement recognition remains an open question [9]. The training algorithm does not specify the way to recognize. A causal model is formidable to build or acquire. End-to-end learning relies on data for the cognitive task [10, 11], where the data cannot tell the reasons [12].

In a supervised pattern recognition task, a set of training data (training set) is used to train a learning procedure. A training set is a set of instances that are properly labeled by hand with the corrected labels. The learning procedure attempts to recognize the instances as accurately as possible. The goal of the learning procedure is to minimize the error rate on a test set. The question arising in the recognition task is why a new data instance can be classified as a particular category. The problem of supervised pattern recognition can be stated as follows. Assume that the training set only contains instances with label where , and for any . Given training sets , the question is how to label a new instance .

A memory system is involved in the process of recognition. Jacoby and Kelly posit that memory can serve as both storage and a tool [13]. Memory is treated as storage in a recall. In this case, the focus is on the past, and memory is used as computer storage. Meanwhile, memory (from experience) can be used as a tool to perceive and interpret present events.

The implicit memory is acquired and used unconsciously and can serve as a tool/function [1416]. In one experiment, two groups of people are asked several times to solve a Tower of Hanoi puzzle. One group is amnesic patients with heavily impaired long-term memory, and the other is composed of healthy subjects. The first group shows the same improvements over time as the second one, even if some participants claim that they do not even remember seeing the puzzle before. These findings strongly suggest that procedural memory is completely independent of declarative memory [17]. Given a game state, implicit memory is trained to output an operation. Memorizing the solving steps is not necessary. When the state appears again, the input can evoke the trained operator [18].

Usually, humans implement recognition processes unconsciously, and the focus is on current given images. So the memory works as a tool while implementing the recognition processes. The recognition process depends on the similarity comparison between the current input and the labeled instances. The more similar, the more likely they are in the same class and have the same label. However, the way to compare similarities without memorizing any labeled instances is not evident.

This paper proposes an implicit memory-based method for supervised pattern recognition. The method does not memorize or recall any labeled instances and is not in the two-stage or end-to-end categories. The proposed method has interpretability since similarity criteria are used in the process of recognition. A new instance is recognized as a particular class because the instance appears similar enough to the training data of the class. Compared with the -nearest neighbors algorithm [19], the proposed method does not need to recall and iterate through the training sets. The process is consistent with the human ability of pattern recognition. People may forget most of the training instances, but they can recognize a new instance. The Mixed National Institute of Standards and Technology (MNIST) database (General site for the MNIST database: http://yann.lecun.com/exdb/mnist) is used to verify the proposed method.

The rest of this paper is organized as follows. First, a model is built to describe implicit memory. Second, with the implicit memory model, a recognition algorithm is proposed. Then an application and the analysis of experimental results are given.

: expresses the cardinality of the set ; is an -dimensional space constructed by 0 and 1; is a metric on ; define a distance function between any point of and any nonempty set of by

Given two -dimensional and , the element-wise product of and , written , is the vector with elements given by ; expresses the number of 1 in the binary vector ; the probability of is written as ; ; ; .

The expression of an inverter (NOT circuit) can be expressed as follows: If the input variable is called and the output variable is called , then . If , then and if , then . The expression of a 2-input AND gate can be expressed in equation form as follows: . The output of an AND gate is 1 only when both inputs are 1 s. The expression of a 2-input OR gate can be expressed as follows: . The output of an OR gate is 1 when any one or more of the inputs are 1 s.

2. Model of Implicit Memory

In this section, a model is built to describe implicit memory. From one given input, implicit memory can give an output. Both the input and output of implicit memory are actual signals. The signals have explicit physical meaning in the real world, such as sound, light, electricity, etc. The types of input-output signals can be the same or different. Meanwhile, the input-output signals can be measured and represented as a binary vector.

The action of implicit memory can be represented as a function . The domain is a set of binary vectors. The element of represents the input signal of implicit memory. The codomain is also a set of binary vectors. The element of represents the output signal of implicit memory. The set of all ordered pairs represents the training result of implicit memory where is an element of and is an element of . An example of the function is given as follows: the domain is chosen to be the set , and the codomain is the set . Figure 1 shows these mapping relationships.

A computer can store all ordered pairs in a database. Given an input, the computer can search the database for output. With the database, the computer can simulate the external behaviors of implicit memory. However, this method needs to recall the ordered pairs stored in the database. This internal process is different from the process of implicit memory. The implicit memory does not need to execute memorizing or searching. The implicit memory is similar in operation to a high-speed logic circuit. From one given input, the implicit memory would not have a second’s hesitation of giving output.

A logic circuit and a database have different implementations. In the preparation phase, the database stores the input-output pairs on a hard disk. However, the logic circuit connects the logic gates to implement the input-output maps. At run time, the database searches the given input on the hard disk to find an output. In the logic circuit, nevertheless, the input goes through the logic gates to get the output. The complexity of the practice data also makes it arduous to manually design logic circuits. However, ANN can simulate the actions of implicit memory automatically.

The capacities of the implicit memory can be represented as a set where and are binary vectors. Without loss of generality, assume that all of have the same length and all of have the same length. And then, the operation of an implicit memory can be expressed with a table. The table lists all allowed input vectors with the corresponding output, as illustrated in Table 1 for an example. The table shows the output for each allowed input. Only the first bit’s implementation of the output signal is considered since the other implementations can operate in the same way. The table can serve as a truth table.

Truth tables are widely used to describe the operation of logic circuits. With a truth table, the sum-of-products (SOP) expression can be written. The Boolean SOP expression can be obtained from the truth table by ORing the product terms, for which is

The first term in the expression is formed by ANDing the three variables , , and . The second term is formed by ANDing the three variables , , and . The logic gates can be used to implement the expression. The logic gates are constructed as follows: two inverters to form the and variables; two 3-input AND gates to form the terms and ; and one 2-input OR gate to form the final output function, . Figure 2 shows the logic diagram. Therefore, a logic circuit can simulate the actions of the implicit memory.

2.1. ANN-Based Boolean Operation

NAND gate is a universal gate because it can be used to produce the NOT, the AND, the OR, and the NOR functions [20]. With NAND gates and appropriate connections, all logic circuits can be built.

Suppose there is a sigmoid neuron with two inputs, and . The sigmoid neuron has weights for each input, , , and an overall bias, . The output of the sigmoid neuron is , where is called the sigmoid function and is defined by

When and , the sigmoid neuron is shown in Figure 3. Then the input 00 produces output 1 since . Similar calculations show that the inputs 01 and 10 produce output 1. But input 11 produces output 0 since . Therefore, the sigmoid neuron can implement a 2-input NAND gate.

Let the value of be 0. The output is not affected by the input . The sigmoid neuron equals a 1-input neuron. When and , the sigmoid neuron is shown in Figure 4. Then input 0 produces output 1 since . But input 1 produces output 0 since . The sigmoid neuron can implement a 1-input NOT gate.

When and , the sigmoid neuron is shown in Figure 5. Then input 0 produces output 0 since . But input 1 produces output 1 since . The sigmoid neuron can implement a connecting line.

2.2. Simulation of Implicit Memory

According to De Morgan’s laws, Boolean expression can be resolved into 2-input NAND and 1-input NOT. For example,

A neural network of five layers can implement expression (4), as is shown in Figure 6. This method of constructing the network has generality. ANN can be configured to execute an arbitrary map. Therefore, ANN can simulate the actions of the implicit memory.

Sometimes a situation arises in which some input variable combinations are not allowed. Because these unallowed states never occur in an application, they can be treated as “do not care” terms. For these “do not care” terms, either a 1 or a 0 may be assigned to the output. The “do not care” terms can be used to simplify an expression. Table 2 shows that, for each “do not care” term, a is placed in the output. As indicated in Figure 7, when grouping the 1s on the Karnaugh map, the s can be treated as 1 s to make a larger grouping or as 0 s if they cannot have the advantage. The larger the group, the simpler the resulting term [20]. With “do not care”, the neural network of five layers can be simplified to a connection line between and .

Using the ordered pair set as a training set, the ANN can be trained to simulate the implicit memory. The optimization process of training the ANN has some similar properties to a Boolean expression simplification process. Their purpose is to implement the required functionality with minimal resources, such as logic gates and activated neurons. The brain might have less than 1% neurons active at any given time [2123]. If an input vector is not included in the training set, then the ANN can treat the input as a “do not care” term. The ANN can output either a 1 or a 0 when the input is a “do not care” term.

Suppose the supervised learning algorithm trains an ANN. While the given input is in the training set, the output of the trained ANN can be expected. Otherwise, the output of the trained ANN cannot be expected. Only by actual measuring can the output be known. The measurement process is like sampling from a statistical population. The trained ANN model can be expressed as a function . The function has the following properties:(i)The output of the function is a specified value when the input is in the training set. That is, if is in the training set, then .(ii)Otherwise, the output can be assigned by generating a sample from a statistical population.

The following section proposes a pattern recognition algorithm with the above function .

3. Recognition Based on the Implicit Memory Model

The principle of recognition is based on the intuitive assumption that examples in the same class are closer/similar [24, 25]. Therefore, a recognition algorithm is proposed to estimate the similarity via the implicit memory model.

Let’s denote a signal by , where , . When a function can precisely predict any masked part of the signal and is not the same as , distinguishing between the signal and any other signal is feasible. The following theorem describes how to recognize the signal x.

Theorem 1. Consider the signal is a constant. For an arbitrary mask , a function satisfies where . Construct a function of the form . If an input is not included in the codomain of the function , then the output of the function is assigned by generating a signal, , from a random number generator. The generator can produce binary digits, 0, 1, through equal probability sampling. , where . Let be a new given signal, and following theorem describes how to recognize the signal .whereIf , thenIf , then

The proof of Theorem 1 is given in Appendix.

According to equations (7)–(9), recognizing the signal with the function is feasible. When a point can help the function to retrieve the signal , the point is called an evoked point. Let

There are evoked points, where . If , then the expected value of is .

The function can retrieve the signal from the evoked point, , in the current input signal since the signal can be expressed as . By comparing with , identifying whether is the same as or not is feasible. If there exists a mask that can make , then . If , then each mask can make .

In a similar way of recognizing the signal (Theorem 1), identifying whether the new given signal is in a set or not is also feasible.

Lemma 1. Suppose . If the signals, , satisfy for any , then there exists positive integer, , s.t., for any .

The proof of Lemma 1 is given in the Appendix.

Theorem 2. Suppose the signals, , satisfy for any where is a constant. Without loss of generality, assume that for any . LetwhereIf , then the cardinality of satisfies thatIf , then

The proof of Theorem 2 is given in Appendix.

Assume that any signal in set does not equal , i.e., for each . Then, where . If , then the expected value of is .

When a new signal appears, it is possible to identify whether is in the set or not with a function . Construct the function of the formfor any , . If an input is not included , then the output of the function is assigned by generating a signal from a random number generator. If there exists a mask that can make , then . If (for example ), then there exists at least one mask that can make since .

To identify whether a new given signal has the same label as , the distance/similarity between and can be used, such as

Without memorizing any element in , it is also possible to estimate the similarity by using the function .

Let

If , then there exist evoked points. Let . Each evoked point, , is corresponding to a mask, .where . The evoked points and the masks can be used to retrieve . With the function , can be gotten by

The similarity between and can be estimated by

The process of similarity estimation is influenced by . If has the same meaning as element in , then has to overcome the influence. The drawback of this estimation is that the algorithm might not traverse all elements in . If , then there are no evoked points that can help us to retrieve . However, the advantage is that intentionally recollecting all elements in is not necessary.

By training functions to predict the masked part of elements in their respective instance sets, i.e., , recognizing a new given signal as one category is feasible. Let , , andwhere , . Function satisfiesfor any , . If an input is not included , then the output of the function is assigned by generating a signal from a random number generator.

When a signal is to be labeled, the similarity rule is a natural choice. To reduce the influence on the analysis of similarity, replace the infimum with average. The similarity between and can be estimated by

The smaller the similarity is, the more likely they are in the same category and have the same label. The above recognition model is presented in Algorithm 1, which is called the Implicit Recognition Model.

input: is a training instance set and is the known label of ; , labeled , is an approximator and can be trained; ; is a testing instance.
          Cognitive Process
(1)for each do
(2)  repeat
(3)   Observing a signal in ;
(4)   Training the prediction function to predict the masked parts of the signal;
(5)  until For each signal and any , .
(6)end for
(7)return Prediction functions labeled with respectively.
          Recognition Process
(1)for each do
(2)  Estimating the similarity between testing signal and instance set ;
(3)  Let ;
(4)end for
(5) Assigning the image to the class of its highest similarity;
(6) ;
(7)return.

In summary, the Implicit Recognition Model uses similarity comparison to do the recognition. But intentionally recalling any labeled signals is not necessary. The focus is not on the past, but on the current input. The evoked points are objective features, which can help us to retrieve labeled signals from the current input. Both evoked points and retrieved signals have explicit physical meaning in the real world, which are objective pieces of evidence to support the judgment.

4. Experiment

In the application, the Mixed National Institute of Standards and Technology (MNIST) database is used to verify the Implicit Recognition Model. The MNIST database is one of the famous image classification benchmarks [26]. There are 60,000 instances in the training set and 10,000 instances in the testing set. The database is created by remixing the samples of NIST’s original data sets. The black and white images from NIST are normalized to fit into a pixel bounding box and antialiased, which introduces gray-scale levels [27]. The first 500 elements of the training set are used for training. All of the testing instances are used for testing.

is labeled and only contains training images with label where . Among the first 500 training images, there are training images labeled 0, training images labeled 1, and so on. . All the 500 images are different. for any .

In the process of recognition, an image in the testing set is recognized as the class of the highest similarity. To execute Algorithm 1, a distance function is redefined first. Cosine distance is a usual measure of similarity in machine learning [28, 29]. Given two vectors and , the cosine similarity is represented bywhere and are components of the vectors and , respectively. The angle is used as the distance between two images in this application. Therefore,

With the artificial neural network, the implicit memory is simulated to execute the Implicit Recognition Model (Algorithm 1). The main process of Algorithm 1 is to train 10 approximators of the functions . Then, can be used to retrieve images in from the evoked points of the current input image and estimate the similarity between and .

To imitate the human recognition process, 10 ANNs are trained in two steps. The first step is to train a base neural network to complete all training images by filling in missing regions of a rectangular shape. This step does not use labels. In order to finally generate the specific function , the second step is to add routing layers to the trained base neural network. With supervision, the routing layers are trained on the instance set . Routing neural networks are used to approximate the specific functions.

4.1. Architecture of ANN

A fully convolutional neural network is modified to be the base neural network. Refer to [30] for more details of the original fully convolutional neural network. The modified network parameters are given in Tables 36. Behind each convolution layer, except the last one, there is a Rectified Linear Unit (ReLU) layer. The output layer consists of a convolutional layer with a sigmoid function instead of a ReLU layer to normalize the output to the [0, 1] range. “Outputs” refers to the number of output channels for the output of the layer (Tables 36). Fully connected (FC) layers refer to the standard neural network layers. The output layer of discriminator consists of a fully connected layer with a sigmoid transfer layer. The discriminator outputs the probability that an input image came from real images rather than the completion network.

Routing layers are added to the trained base neural network to generate an approximator of the specific function . Figure 8 shows the architecture of a routing neural network. Three routing layers are inserted in front of each network layer. The routing layer performs element-wise multiplication between the input and the routing weights , which can be represented as . The initial value of routing weight is set to . The role of routing layers is to keep similar to , but the distance between and is becoming far with the training, for any , .

4.2. Training Method of ANN

While the base neural network is training, global and local context discriminators are trained to distinguish real images from completed ones. The base neural network is trained to complete images by filling in masked regions of a rectangle. The training operation on the mask set is too large to calculate. A subset (denoted ) of is selected with a rectangle masked part of 7 to 14 for the width and height.where

All the 500 training images (denoted ) are used to train the base neural network.

When the training of the base neural network is finished, routing layers are added to generate an approximator of specific prediction function . While the routing layers are training, the parameters of the base neural network remain unchanged. Ten digit labels/categories correspond to 10 approximators of prediction functions. Each prediction function has a unique set of routing parameters.

The training method for routing layers is the same as the base neural network. The routing neural network is also trained to complete images by filling in masked regions of a rectangle. However, only the instance set is used to train routing parameters of approximator of . The three routing layers are trained one after another. While one routing layer is training, the other two routing layers remain unchanged. While is becoming less than for each signal and any , the cognitive process finishes. The trained routing neural network is an approximator of the specific prediction tool/function .

4.3. Experimental Results

The neural network models are created by TensorFlow 1.8.0. The learning rate of the Adam optimizer is . A batch size of one image is used for training. First, the base neural network is trained for 100 iterations. Then both the discriminator and base neural network are jointly trained. In each iteration, all 500 training images are shuffled and traversed. For each image, a masked rectangle region is randomly selected to train the neural networks. By 1000 iterations, the base neural network can predict the masked part (Figure 9).

Then, keep the base neural network unchanged, and add routing layers. The first routing layers are trained for 260 iterations and then remain unchanged. The second routing layers are also trained for 260 iterations and then remain unchanged. The approximation of a specific prediction function finishes when the third routing layers are also trained for 260 iterations.

A set of routing parameters correspond to a specific prediction function. Each approximator of functions is corresponding to a routing neural network with a particular set of routing parameters. The approximator of function can only work on the training set . Figure 10 gives an example. The mean distance is less than for each where (Figure 11).

With the approximator of the prediction function , the algorithm can estimate the similarity between a testing image and the instance set . A subset of (denoted ) is used to estimate the similarity. Equations (23) and (25) are modified by

30 masks are randomly generated, and . The masks remain unchanged when the algorithm calculates . Finally, the image can be recognized as the digit satisfying that for any .

The masked part is the key for doing the similarity comparison. If the mask’s size is too small, and are mostly the same. There is no enough data to demonstrate the similarity. The main cause of high similarity should be from the accuracy of prediction (i.e., is similar to ), not the small mask window (i.e., is small). An appropriate size can ensure the estimation results would be more comparable and credible. For a similarity comparison, the mask’s window size cannot be too small. Meanwhile, the size cannot be too large to train ANN. The bigger the size is, the harder the masked parts are to be predicted.

Using the Implicit Recognition Model (Algorithm 1), the correct recognition probability approaches 100% on the training set (contains 500 training images). “Digit 4” and “digit 9” are easily to be confused. 80.44% recognition results of the Implicit Recognition Model are the same as the labels, and the confusion matrix is shown in Figure 12. Experimental results show the efficiency of the proposed Implicit Recognition Model.

The proposed model recognizes image as digit , because is most similar to the instances in . As the similarity becomes lower, both the recognition precision of the Implicit Recognition Model and the one nearest neighbor algorithm (Explicit Recognition Model) decrease while rotating the testing images (Figure 13). Even using the testing image itself to do the rotation similarity comparison, the distance becomes further still (Figure 14), where represents that image rotates degrees. While the rotation angle is around 90 and 270 degrees, the distance approaches 1.22 radians, which is the mean distance between a noise image and the testing image. The pixel of testing image takes integer values between 0 and 255. The pixel of noise image obeys discrete uniform distribution on the integers . In Figures 13 and 14, the curves show W-shape and M-shape, because the images labeled 1 are similar to their rotated images when the rotation angle is around 180 degrees. Rotation of image can cause similarity changes. Therefore, a rotated image can also be used to define a new category if the similarity change is big enough. For example, the main difference between “W” and “M” is the direction of the opening (Figures 13 and 14).

Furthermore, the proposed algorithm is robust while adding random noise to the testing images. Suppose that where is an independent random variable. Noise point obeys discrete uniform distribution on the integers , where and restrict the value range . represents the noise level. The bigger the values, the higher the noise level. The testing images are disturbed by noise (Figure 15) to test the robustness of algorithms. The image in noise is denoted as where

Using PyTorch 1.5.1, the traditional neural network models are modified to do the recognition, too. Two linear layers after the output layer are added. In front of each linear layer, there is a Rectified Linear Unit (ReLU) layer. The testing images are resized and repeated along the channel. Then, the stochastic gradient descent optimizer and a batch size of 32 images are used for training.

By training 1000 iterations, the traditional neural network models can work when the noise level is lower than . However, the performance is not stable with increase of the noise level. The recognition precision of many traditional neural networks drops sharply while the change in the noise level is not big, as shown in Figure 16. However, the performance curve of Alexnet and VGG is smooth, as shown in Figure 17. Overfitting cannot provide the reasons behind the phenomenon since the neural networks are black-boxes. The human can recognize the digits even when the noise level is . Therefore, the more straight the performance curve, the closer it is to the human capability. When the noise level is , the recognition accuracy of the traditional neural networks is less than 20%, but the accuracy of the Implicit Recognition Model is about 50%. The Implicit Recognition Model improves the robustness significantly.

5. Conclusion

Scientists early know the existence of implicit memory. This paper establishes a mathematical model of implicit memory and explains ANN can simulate the model from the view of digital logic circuit design. A trained ANN can be expressed as a function. When a given input is not in the training set, the output of the ANN is hard to control. With the function, this paper proposes a new pattern recognition method, the Implicit Recognition Model. The Implicit Recognition Model works under the similarity rule and has interpretability. Compared to the one nearest neighbor algorithm (Explicit Recognition Model), the Implicit Recognition Model makes similarity comparisons without recalling any instances. The experiment results show the efficiency of the Implicit Recognition Model.

Appendix

A. Proof of Theorem 1

Proof. Construct a function of the form . LetWhen , . If , then . Therefore, .
Let and . For an arbitrary mask , belongs to either or . If , then . If , thenIf , then . Because ,

B. Proof of Lemma 1

Proof. The recurrence method is used to prove this lemma. When , there exists a positive integer, , s.t., since .
Suppose the lemma is true when . Let for any .
If for any , then can be any positive integer where . If , then there exists a positive integer, , s.t., since .

C. Proof of Theorem 2

Proof. Let , where since for any . If , then . If , then since .

Data Availability

The handwritten digits data supporting this study are from previously reported studies and datasets, which have been cited. The processed data are available in the MNIST database.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.