Abstract
To collect full-labeled data is a challenge problem for learning classifiers. Nowadays, the general tendency of developing a model is becoming larger to be able to obtain more potential capacity to effectively predict unknown instances. However, imbalanced datasets still are not able to meet the needs for training a robustness classifier. A convincing guidance to extract invariance features from images is training in augmented input datasets. However, selecting a proper way to generate synthetic samples from a larger quality of feasible augmentation methods is still a big challenge. In the paper, we use three types of datasets and investigate the merits and demerits of five image transformation methods—color manipulate methods (color and contrast) and traditional affine transformation (shift, rotation, and flip). We found a common experiment result that plausible color transformation methods perform worse against traditional affine transformations in solving the overfitting problem and improve the classification accuracy.
1. Introduction
Crowdsourcing is the most viable solution to collect a large number of labeled samples with high quality, but one thing is certain that crowdsourcing is a very time-consuming process and expensive to hire skilled workforce to recognize the millions of data samples. In some specific industry areas, e.g., healthcare, the raw medical data are sensitive and involved in significant privacy implications, for protecting patients against the breach of medical records; these resources are normally veiled by strict secrecy. Similarly, several data samples need to be gathered and labeled by professional knowledge background (e.g., aerospace industry); those factors increase the cost of collecting quality labeled instances. The lacking of labeled samples is a serious issue in deep learning research field, which has the common knowledge that the more training data you assess, the higher the generalization quality the neural network model can learn.
Another approach is to extract more possible feature values from limited data in order to improve accuracy and resilience. A little amount of fictitious data is created using mixed characters or stored as logical values in real-world situations. By introducing geometric [1, 2] or colorimetric distortion [1, 3] to a data space in the computer vision domain, virtual pictures that replicate extrinsic characteristics may be generated. Because of the picture perturbation effect, the learning process must adjust the criteria to account for the noise in the input samples and extract a common object structure to characterize them.
Learning a classifier from synthetic data remains a difficult problem, even with the assistance of data augmentation. When a neural network is trained, the background process may change hundreds of weight parameters at once, and there is no credible theory to determine how the individual weight parameters change and how much data is adequate to represent a reliable model. Nonetheless, a good network model should be adaptable to morphological transformations (e.g., flip, shift, rotation, etc.) and less susceptible to external disruptions such as ambient light or unique photo equipment. [4] utilizing the synthetic data propose a sparse autoencoder algorithm—Multichannel Autoencoder—to bridge the synthetic gap between generated images and real data, which imitate true-labels in terms of object structure and appearances; other analogous algorithms like GAN [5] and DBSMOTE [6] made the synthetic instance more close to the real one and improved precision. By contrast, improper image generation methods may break the density distribution in data space, sparse far from the centroid, and weaken the model’s ability to find invariable object structures. To address these problems, [7] explore a transformation selection system—transformation pursuit; this algorithm is based on greedy strategy to randomly search a large area’s parameters of candidate transformation and find the optimized super-parameters for enlarging particular data samples. Others like [8] do the same thing to select some values that could maximize entropy loss value between two training epochs. All these methods face the same problem—algorithms must to search enormous random parameters for different transformations, which wastes a lot of time without prior knowledge.
In this article, we firstly propose here to compare five augmentation methods, for ease of comparison of their merits and defects; we separated them into two groups (traditional affine transformation and color transformation) according to the human visual effect when natural images apply those random perturbations. Then, we put forward our prior technique in those augmentation methods that affine transformation is much capable than color augmentation to reduce the overfitting phenomenon and improve network accuracy. Secondly, to prove our point and facilitate the study on varied real-world images, we collect a Pests dataset, which has a clear background and easily identifiable object shapes. Along with our datasets and other two standard image processing datasets (VOC, ImageNet), we get the same result in training three different deep learning architectures (AlexNet [9], VGG16 [10], and InceptionV3 [11]), which shows that the precision significantly improved from 74.5% to 85.2% with appropriate transformation on the Pests dataset, and strongly reduced the overfitting on ImageNet and VOC datasets.
2. Related Work
The difficulty with developing a model is that the neural network architecture performs far better in training datasets than in test datasets. In training datasets, the underlying model structure or an insufficient training parameter might be the cause. In the field of deep learning, a simple technique to overcome the overfitting issue is to include a regularization factor in the loss function, which penalizes peaky weight parameters and sparse weight parameter distributions. The sparsity of weights aids the model in determining which input variables are relevant. Typically, at the start of training, the weight values are assigned using a Gaussian distribution, and a particular weight-normalization layer [12] is used to favor diffuse weight vectors. Another widely used technique is dropout [13], which involves manually “deleting” some random connection nodes, and the descriptor could train with different network architectures in each mini-batch, which reduces the training time and improves the generalization capacity by providing loose coupling in network nodes. Furthermore, transfer learning supports a novel concept of applying pre-training weight and fine-tuning a subset of model parameters to meet particular characteristics. The benefit of transfer learning is that the fine-tuning model may be changed very accurately with fewer specific input samples. Data augmentation is useful in decreasing overfitting in dataset preparation because fresh synthetic examples may be created indefinitely, each sample after a slight perturbation, and the classifier would have to reject them in order to acquire additional latent features.
Previous research has focused mostly on the mechanism and sequence of picture production. For example, [14] classified augmentation methods as off-line or one-line deformation, with fresh enhanced images sampled and discarded after each iteration until the network converged. Some standard computer vision dataset (e.g., MNIST) has been widely used to judge the model’s ability and explore the differences between human recognition systems and computer vision algorithms. Commonly, plausible affine transformation (such as scaling and rotations) was applied in [2]; these types of diverse affine transformation parameters were stated to simulate uncontrolled oscillation on hand muscle. Different classical descriptors and a series of experiment performance are evaluated in [15], and forced strong and weak points using SMOTE [16], ELASTIC and DBSMOTE [6] transformation algorithms. In a large visual database–ImageNet, [10, 11] randomly crop from rescaled input images and generate sample patch from 256 × 256 to 224 × 224 to combat overfitting; some approaches apply random color manipulation [3, 9–11, 17] in the synthetic augmentation of images in terms of driving a model less sensitive to illuminate state changes, which is usually caused by real-world light sources, photographer apparatus, or adjustment parameters when shooting.
Despite this, the majority of individuals choose random parameters to expand their training samples. When creating an image via geometric or color casting transformation, the effective down-sampling approach should not disrupt the details and have label-preserving properties. The algorithms in [7, 8] are based on greedy search and trust-region optimization, respectively, which automate the search for optimal transformation parameters and maximize the loss value of robust classifiers. However, the processing takes time to find and backtrack appropriate values. Some experts offered legitimate augmentation strategies based on possible past information in order to minimize search regions. For example, [18] presents a transformation scheme in the melanoma specialist analysis field, with distortion of the lesion main axis size while maintaining symmetries and patter of the lesion, and [19] uses vocal tract length normalization to transform spectrograms to make solutions of speech datasets with the feature of sparsity. Through a multichannel autoencoder, [4] attempts to bridge the synthetic gap between augmented data and actual data.
Recently, Adversarial Nets [5] got plenty of attention; it extracted features from real images and obtained new synthetic imagery, which comprised of certain “selected” features in model. Unromantically, the synthetic data were unable to maintain morphological consistency with human visual sense, although they are similar to real samples that could trick the discriminate net. In our paper, we examined methods by applying the assigned transformation directly on the descriptor and conceptually argued the idiocrasy between traditional affine transformation and the color distribution technique.
3. Method
In this section, we first explain the augmentation methods that we used in the training model. We split them into two categories according to the different visual effects on deformation images.
3.1. Affine Transformation
We assume that the origin Image Matrix is and the transformation matrix is , and all the affine-transformed images could be generated from . For keeping the size of the image, we add the term P as padding at the end of function, which is .
Shift operator: panning the image from one or more directions and points outside the image boundaries filled by the edge pixels or constant 0. In mathematics, a shift operator’s transformation matrix is a special Nilpotent Matrix ; the no-zero value of N is only placed on the superdiagonal or subdiagonal. For example, the 5 × 5 superdiagonal Nilpotent Matrix seems like (1), and the origin image is represented as (2); the matrix that moves up one pixel was got from as the same as the right shift one pixel matrix is , which looks like (3).
Flip: many researchers enhance their training samples by the flip operator without prior knowledge; the reason is that flip operation just overturns the image from the horizontal or vertical direction and the valid pixel information has been kept. In computer arithmetic, given an image matrix , the sequence of is is rows and is columns . Note that reversing the order of the image sequence such as reading rows from , the image is flipped horizontally.
Rotation: the rotation operation acts as a random flip in different angles. Firstly, the origin input image (Euclidean geometry) has been reinterpreted using Cartesian Jones vectors’ transformation in coordinate conversion. Then, we assumed the input sample to be counterclockwise from the origin by angle . The point is from the image in the coordinate, and the rotated point has been generated, where . In our experiment, we make the center pixel of the image as the rotation center. The navel point refers to in a homogeneous coordinate; the object first moves to the origin, where the point changes like , and parameter t is the point–slope, given in . Rewriting the formula by the matrix calculation model in (4) and performing a clockwise rotation of angle generates the rotation matrix (5), given in the transformation matrix (6). Hence, shifting the matrix center from origin back to , and combing (4)–(6) and translation, we get the updated transformation object function as follows: (7)
3.2. Colorimetric Transformations
Many people modify their images with filters that alter the color or contrast distributions of the image in real life, and this may easily affect the shot photos if ambient lighting changes. Because the younger generation like to enhance picture structures by applying a color-casting filter, the collected dataset may include distorted images, and the robustness model should be adjusted to account for the disruption caused by these color manipulations.
Contrast enhancement: In the human visual system, when people look at some photos, the capacity to distinguish between the luminance of levels is contrast sensitivity; humans could easily recognize the difference when images are altered by contrast rather than absolute luminance. There are many existing definitions of contrast; to let the contrast adjustment as close to the human perception system as well as computationally cheap, we taken luminance contrast in our experiments. The formal is used to obtain the input training data, where I denotes the origin input image, I′ is the transformed image, and F is the contrast correction factor, as defined in (8):
In the equation above, the parameter C denotes the desired level of contrast, which formula (8) would operate in each R, G, B channel simultaneously. For convenience, the image contrast was adjusted by PIL (the python imaging library) in our experiments.
Color transformation: People intentionally vary the intensities of colors to show multiple color schemes, generally operating directly on tricolor (red, green, and blue) channel pixels, which affect the overall mixing of colors in images and sophisticated color balance corrections. Normally, the images have been shot to neutrals with the right temperature color fitters and ambient light. Unnatural color temperatures, on the other hand, might result in undesirable casts. To simulate this form of color casting, random color perturbation to input pictures was also generated using the PIL library package.
4. Experiments
4.1. Datasets
In our experiments, we test three representative image datasets to verify the reliability of the thesis. The number of each category was kept equal to avoid the effect from class imbalance; besides, each class’s size is relatively small to investigate the performance of the model when picture quantity was limited. 80% of the samples were divided to training, and the other 20% were fed to validation; when the model started training, augmented data were generated continuously until validate accuracy did not increase.
ImageNet: we artificially selected a small subset of ImageNet dataset, which consist of 10 classes: 500 samples for training and 100 for validating. Before transformation augmentation was applied, the size of the image is preserved in the original to evaluate the effectiveness of the augmentation technique and the model performance when training strongly discrete data. The object may lose its original shape feature, and just leave half or part of it. For example, some fruit or vegetable (e.g., pumpkin) has been cooked into food (pumpkin pie), which is totally different from the original (Figure 1)

The rest of the data come from Pascal VOV datasets. This standard picture dataset is often used for object class detection in photographs with complicated background information and multiple objects. It should be noted that VOC often has many labels for the same picture; we delete the duplicate tag images and replace them with fresh samples to ensure that all of the photos belong to the same category. Finally, there are 20 classes in the restructuring datasets, each with 10,000 identified photos.
Last but not least, there’s our Pests dataset (in Figure 2). Each picture is “pure,” meaning there is just one item in each shot, the backdrop is generally painted a solid color, and the object has a definite border. We designed it based on the visual system of a human; thus, the equivalent item is simple to detect and ensures network convergence rapidly. In addition, we normalized the object scale and had objects occupy “optimistic” average scale (0.3 to 0.5) [20] proportion of picture area to guarantee that object scale would not affect experiment findings. The picture size is resized to 2242243, with 10 classes and 500 photos allotted to training and the remaining 100 to validation.

4.2. Classification Descriptor
In this paper, we evaluated three deep convolutional classification descriptors. Details of each network architecture are illustrated in Table 1.
The AlexNet [9] reached a new developmental milestone in the convolutional neural network; it trains complex deep convolutional models in multiple GPUs, and shortens the training time of the neural network within a reasonable range by highly optimized parallel computing; these operations make deep networks possible to train large-scale datasets. The network consisted of five convolutional layers, three fully connected layers, and embedding with a Softmax layer, with the help of ReLU Nonlinearity and Local Response Normalization (LRN); at the first two convolutional layers, input weight parameters prevented from saturating and reduced the impact of gradient vanishing and exploding problem; AlexNet overcame the issue of overfitting and got top-5 in predicting the error precision (15.3%) at 2012 ImageNet large-scale visual recognition challenge.
In recent literature, the impact of depth is hot topic, as the representative of deep neural network, the VGG [10], reached state-of-the-art performance using only 16 convolutional layers and two fully connected layers. The whole network architecture was split into deep stacks of 3 × 3 kernel size to reduce the weight parameter, which proved effective in the receptive field. The VGG architecture was employed in our idea to analyze the depth impact for our augmentation strategy. Finally, instead of increasing the number of network layers, inception constructs itself using bocks, which are made up of numerous convolutional layers with varying convolutional filter sizes. Inception [11], which uses appropriate factorization to divide a 3 × 3 filter into 3 × 1 and 1 × 3, maintains a better balance of compute and memory while speeding up training and improving network nonlinear expression. We evaluate the network in Inception V3 to find out the influence of special network connection.
4.3. Experiment Settings
We use a series of experiments based on three network architectures and datasets to get to the conclusion that color augmentation is less effective than the usual affine transformation approach. The synthetic image would continue to be generated to train the class descriptor until the validate loss did not decrease, as shown in Figure 3. First, the laboratory data were separated into training and validation on a 4 : 1 scale, and then fed into various types of augmentation methods, and the synthetic image would continue to be generated to train the class descriptor until the validate loss did not decrease. Table 2 illustrates the augmentation parameter settings in detail. The decision rule of the color augmentation parameter is kept to ensure the perturbation in image would not be over 20% compared to the original. When one architecture and one dataset are selected, the other super-parameters are kept consistent; the training optimizer was set to SGD (stochastic gradient descent) with 0.9 momentum and decay ; in addition, the initial learning rate for datasets Pests and VOC was and for ImageNet was , which helps avoid the huge swing. The batch size is fixed at when training the VOC database and 8 for others, for shortening the training time and decreasing the gap in training accuracy and validation accuracy. The function of early stopping was employed to monitor the quality of validation loss; when the minimum change in validation loss is less than , the model will continue to train the five epochs before being stopped. We account our results from the average value of three repeated experiments.

5. Result
We first consider the five augmentations which are divided into two groups—color and tradition—and evaluate the training and validation average accuracy of them in Table 3.
The experiments show that there is a major overfitting issue in the ImageNet and VOC datasets, despite the fact that the model automatically stops training when the validation loss does not decrease. As a consequence of the small data sizes and varied man-made object structures, neural networks were able to achieve “optimistic” training accuracy and worst-case prediction outcomes with ease. The network attempted to incorporate extraneous picture characteristics and established incorrect boundaries for distinct categories using the ineffective augmentation strategy, expanding the gap between the training and test accuracy. The classic transformation method, on the other hand, could correctly analyze unknown photographs and shorten the distance, reducing the propensity of overfitting. The ImageNet and VOC datasets have particularly provide this evidence; the input image’s background and recognized object are complicated, mutative with fuzzy noise and redundant information; nevertheless, traditional affine augmentation distorts the image and preserves the property of object features, which improves the ability of the model to learn crucial features and remove noisy background information. In Pests images, three network architectures did well when using augmented data, but the challenge of overfitting remained in the color augmentation method. Another interesting note was that the network with few layers and completion tends to show better results in both training and validation; this may due to the preset super-parameters and no fine-tuning operation in training the limited labeled data, even though it also highlights the important of selecting the proper augmentation method.
5.1. The Performance on Data Pests
From the initial phase, it was discovered that deep neural networks are prone to extracting a large amount of residual variation as the foundation structure for generalizing models with additional needless parameters in order to suit new data or accurately forecast real-world objects. As a result, we will ignore the phenomena of overfitting in this section and just compare the validation accuracy to explain the performance of each augmentation approach.
Figure 4 shows the progression of this validation. Most notably, although each augmentation technique performs differently, classic affine augmentation outperforms color augmentation, and the rotation manipulator has a high prediction accuracy of 85.2% when compared to the other approaches. Another important element to note is that when it comes to training, the flip operation performs somewhat better than color. Comparing the inception model, large distance was observed between other traditional methods, and we infer that this singular result could be caused by a special network block module in Inception V3, where the same receptive field in the spatial convolutional filter is broken to a convolutional kernel followed by a convolutional fitter. These asymmetric convolutions are liable to fit flipping transformations. For example, the image seems like , and after horizontal reverse could be . According to the convolutional function with the convolutional operation in the filter size of and , the network studies the proper weight vector value and to make equal to. .

5.2. The Performance on ImageNet
The results of the diverse augmentation algorithm in ImageNet were analogous with previous Pests one, where the color perturbation method allowed better fitting in the training samples when generating many obscure features and evaluating worst by inadequately capturing them to represent the model structure. In ImageNet, we further analyze the network training curve in those six argumentation methods. Figure 5 illustrates how the validate accuracy varies as the number of epochs increased. The first observation was that the input samples got from contrast, color, or origin raise rapidly at the beginning of the training and almost stopped at the same time around 0.6 after about 35 iterations when validation loss did not increase, while other traditional transformations raised smoothly and costed more time to fit elastic deformable objects. To address this issue, the model has to learn more scale-invariant features from the distortion of objects; those scale-invariant information also hold for predicting samples with high probability. However, it cannot ignore the fact that the curve produced by the augmented painting is similar to the real image, and it is more inclined to reach saturation in less training time, that is, the number of characteristic parameters in the predicted data is less than the number of observed models.

5.3. The Performance on VOC
In the experiment with the VOC dataset, we emphasized to create augmented samples to compare their capability of overcoming the overfitting problem. The Figure 6 displays the overfitting rate of each transformation method in training AlexNet, VGG16, and InceptionV3 models. The advantage of the conventional affine transformation algorithm lies in the large and obvious difference in the results of the color enhancement method. In all aspects, the rotation operation can effectively narrow the gap, and its role is the same as that tested in Pest data. The combination of shift transformation and center rotation transformation increases the difficulty of model fitting. Color augmentation, on the other hand, proved ineffective in reducing the impact of overfitting; in training with contrast-enhanced pictures, the Inception V3 model had a 0.53 overfitting rate, demonstrating the failures of learning crucial characteristics. Another observer observed that as the convolutional network depth grew, the model became more difficult to train, causing the model to fail to converge when the amount of input pictures was insufficient to meet the demands.

5.4. Experimental Analysis
Why does color augmentation learn features better in training samples but hard to predict test images? Our intuition is that the methods of color and contrast casting break the balance in entire pixel distribution, which distorts the photos in whole pixel values rather than parts. For illustrating this phenomenon, we picked up one image and drew its histogram at three contrast augmentation parameters. The chart (Figure 7) displays the pixel distribution of the image, when the parameter 1.0 shows the original one. When the image contrast changes to 0.7, the image loses more “color”; composite images tend to return pure grey, with the histogram more closely resembling a Gaussian distribution, resulting in the disappearance of many of the white pixels necessary to form the outline of the butterfly’s wings. When the contrast value goes up, the black pixel increases sharply and the number of intermediate pixel value spreads to other pixel values; the curve of the histogram changes smoothly, which occurs as the fuzzy edge of the object in the figure as well. The obscure image lost many scale-invariable features, where the network fit them well and worse in real-world objects.

On the other hand, in tradition affine transformation, though distorting the shape of the object in the image, the valid pixels of image lost were less than color transformation; it preserved the most scale-invariable features and moved origin pixels from the origin location to new positions. It forced the network model to readjust the weight value; when fitting the synthetic image transformed under different enhancement parameters, a visual layer is built to correct the affine transformation. This point might tell us why the rotation operation was effective in reducing overfitting; for obtaining the desired image orientation direction, each pixel’s was position replaced after multiplying by sines and cosines on the homogeneous coordinates.
6. Conclusion
This paper demonstrates the performance of five data augmentation methods in improving the accuracy of image classification. A series of experimental results strongly prove our thesis that applying proper affine transformation to generate synthetic data is the robust way than changing the effect in image color and contrast, which can boost training and test accuracy as well as reduce overfitting. We surmise that the bad outcome in color augmentation could be due to a break in the density of the feature space distribution; it is an interesting future direction of research to consider along with wider range parameters of color manipulation. It is also necessary to testify our conjecture in various resource data and diverse network architectures. Yet, another research issue is why rotation transformation is better than other affine transformation to reducing the overfitting phenomenon when selecting the proper rotation angle. Finally, it is also interesting to test other affine transformations such as zoom, shear, and/or the combination of those transformations to get a general standard to select different transformation methods.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the grant from the 2021 Cangzhou Science and Technology Plan Program (Grant No. 213102007) and 2022 Scientific Research Projects of Colleges and Universities in Hebei Province (Grant No. QN2022200).