

ORIGINAL ARTICLE 

Year : 2018  Volume
: 4
 Issue : 1  Page : 1621 

Prediction for pathological image with convolutional neural network
Wenshe Yin^{1}, Yangsheng Hu^{1}, Qingqing Dong^{1}, Sanli Yi^{1}, Jun Zhang^{2}, Jianfeng He^{1}
^{1} Institute of Biomedical Engineering, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China ^{2} Faculty of Education, Yuxi Normal University, Yuxi, Yunnan Province, China
Date of Web Publication  18May2018 
Correspondence Address: Jianfeng He Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan Province China
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/digm.digm_46_17
Background and Objectives: The diagnosis of cancer is concerned, and the prediction of cell carcinoma is of great importance for the treatment. Materials and Methods: First, we obtain a series of slices of tumor cell pathology in clinical data, with being followed training sets and test sets gained by adding data model. Then, we design a convolutional neural network training and prediction model. After that, we optimize parameters for training and prediction model, combining experience. Results: In experiment, the accuracy of the model predicting for cell carcinoma is 87.38%. Conclusions: This study provides a reference that predicts the extent of cell carcinoma progression by using deep learning model.
Keywords: Cell carcinoma, convolutional neural network, deep learning, pathological image
How to cite this article: Yin W, Hu Y, Dong Q, Yi S, Zhang J, He J. Prediction for pathological image with convolutional neural network. Digit Med 2018;4:1621 
Introduction   
Cells are the most basic unit of human body. Its regular decay of growth is vital to the human body. Pathologists can detect the differentiation of cell to determine whether the cells become cancerous. Because of the spatial and temporal heterogeneity of genes and uncertainty in texture and shape, pathologists, although experienced, still have diagnostic error rates between 30% and 40%. Thus, an efficient, highly accurate method of predicting the degree of deterioration of cell carcinoma has become particularly important.
Predicting the degree of deterioration of cancer has received much attention. Especially with the development of feature extraction, classifier, machine learning algorithm, and deep learning, a large number of researchers have begun to study this field. For instance, Araújo et al.'s research – classification of breast cancer histology images using convolutional neural networks ^{[1]} and Sun et al.'s research – the enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data.^{[2]} On the basis of related studies, this paper proposes a method for predicting the degree of deterioration of cell carcinoma based on deep learning.
Materials and Methods   
Deeplearning framework
Deep learning builds a neural network that simulates human brain learning to interpret data. It can discover more distributed features by combining lowlevel features into abstract highlevel features. Convolutional neural network is a kind of analog human brain network. It has not only scale invariance but also weight sharing.^{[3]} The convolutional neural network is composed of feature extraction, feature selection, weight optimization, and model construction, which are completely integrated in the hidden layer of neural network.^{[4]} This paper chooses the open source convolutional architecture for fast featureembedding (Caffe) deeplearning framework,^{[5]} Caffe was proposed by Yangqing. J. It is an endtoend machine learning framework that can be applied to visual, voice, astronomy, and other fields, which is used by many open source projects on GitHub.
Designing the structure of convolutional neural networks
The origin of convolutional neural network is mainly through two stages: receptive field and neocognitron.^{[6]} It is more suitable for the biological neural network, and its weightsharing and multidimensional image input can avoid the feature extraction, feature selection, and data reconstruction process of supervised learning model such as support vector machine.^{[7]} Its scale invariance also lays a technical foundation for the development of convolutional neural networks.
At the input layer of the training network structure, that is, the data layer, the batch size is set to 64, which can be adjusted according to the computer processing ability. Since this paper is based on graphics processing unit (GPU) training, the batch size is set to 64. For the import data file, Caffe support lightning memorymapped database (LMDB) and LEVELDB, two types of files, which are key/value pairs embedded database management system programming library although LMDB memory consumption is 1.1 times that of LEVELDB, but LMDB is faster than LEVELDB 15%, so this paper chooses LMDB. To set the pixel value of the channel image in the range of (0, 1), we set thescale to 0.00390625.
At the convolution layer, we set weighted learning rate of filter lr–mult is 1, and bias learning rate bias is 2, where bias is usually two times the size of lr–mult, which can give a good boost in convergence speed. The number, size, step size, and filling of the convolution kernel have great influence on the extraction of feature and the final verification set prediction. Based on prior knowledge and experience, we set the number of convolution kernels to be 32, the step size is 1, and the convolution kernel size is 1. As to the Gauss, distribution can be combined with the mean and the standard deviation to generate the initial weight parameters of the convolution kernel, which is regular for the generation of feature maps and it has a sparsity, so we use the Gauss distribution random initialize convolution kernels, the mean and the standard deviation are 0 and 0.001, respectively.
At pool layer, because the texture features of cells are important for predicting the extent of deterioration, the maximum pooling method can better preserve the texture features, so we use the maximum pooling method that set the pool window size to 3 × 3 and the window sliding step length to 2 empirically.
At activation layer, this paper uses rectified linear units (RELU) as the activation layer and uses max (x, 0) as the activation function when x > 0, the output is x and when x ≤ 0, output 0. It is sparse and can reduce the gradient to a great extent.
Fully connected layers are the same as convolution layer in form, while the standard deviation of Gauss distribution is 0.1 and the output class number is 5. The number of input classes is the same as the level of severity. The kernel function also has some imaging accuracy.^{[8]} In this paper, the sigmoid function is used in the first fully connection layer, and radial basis functions are selected in the second fully connection layer, that is, the output layer. Its definition is as follows:
Sigmoid function:
Radial basis function:
The above network structure, parameters, and functions can be adjusted as needed. For example, if you are using central processing unit (CPU) to train the prediction model, you can reduce the batch size, which can solve the problem of memory overflow; if the computer memory is not enough, you can consider the training set and the test set into LEVELDB type as the input file, thereby reducing memory consumption; if you want to further improve the convergence speed and accuracy, you can make the average file weight; learning rate parameters, biased learning rate parameters, convolution kernel number, convolution kernel size, step length, filling pool size, and window size and length can be adjusted according to the feedback loss value and accuracy; for the random initialization of the convolution kernel, you can also choose a constant initialization, uniform initialization, Xavier initialization, or bilinear initialization. For the pool layer, you can also choose the mean pool or random pool as pooling method, where the mean pool has the feature of preserving the whole feature and the random pool will not cause the over distortion of the feature graph; above pooling method can also be used in each layer to ensure the integrity of feature extraction. For the activation layer, the sigmoid function can be used instead of the RELU function, but sigmoid is easily saturated with function, resulting in slower loss of function convergence. The training network structure is shown in [Figure 1] and the structure of the prediction model is shown in [Figure 2].
Learning rate strategy
As data sets increase and network structures become deeper, training of depth models often takes a long time. Therefore, how to choose or design a deep learning strategy is an important factor to improve the speed of convergence and reduce training time.^{[9]} The solver optimization method includes a variety of learning strategies,^{[10]} which updates the parameters by forward inference and backward gradient computation, thus reducing loss. Its definition is as follows:
where D is the given data set, N is a random subset, N is much smaller than ∣D∣, L(W) is average loss, λ as weight, W as updating weight, is the loss of the X^{(t)} item in the data, r^{(W)} is a regular term, which will calculate f_{w} for forward processes, that is loss, or calculate ∇f_{w} for reverse processes, that is gradient. Then, the parameter update quantity is ∇W calculated according to the gradient ∇f_{w} and the gradient of the regular term ∇r(W) and so on.
Currently, there are many learning rate strategies such as stochastic gradient descent (SGD),^{[11]} adaptive moment estimation (Adam),^{[12]} AdaDelta,^{[13]} and Combinatorial AdaMix method proposed by Yuyao and Baoqi et al.^{[9]} We use MNIST data set ^{[14]} to train handwritten numeral recognition model to observe the iterative process of the first three.
The recognition accuracy based on Adam learning strategy can reach more than 90%, but it will produce great shock in its iteration which is not conducive to the stable rise of accuracy. The recognition accuracy based on AdaDelta learning strategy also can reach more than 90%, but the convergence speed is too slow, and it is difficult to improve the accuracy at the end of iteration. The recognition accuracy based on SGD learning strategy is almost 100%, and its convergence speed is very fast. About 1000 times before the iteration, 98% accuracy has been achieved.
Although SGD is not an adaptive optimization method as Adam or Adaptive as Gradient, but according to the characteristics of the function itself adjust the learning rate, in many cases it is a simple and effective optimization method.^{[9]} In this paper, the SGD method is defined as follows:
V_{t}_{+ 1} is the update value of this time, W_{t + 1} is the weight of this time, V_{t} is the update value for the previous time, W_{t} is current weight, the learning parameters α and μ are the weights of the negative gradient and the weight of the primary update respectively, ∇L(W_{t}) is negative gradient.
Data selection
In this paper, 165 images of cellular image pathological recognition data sets are used as training set and test set supplied by Warwick ^{[15]} (https://www2.warwick.ac. uk/fac/sci/dcs/rese arch/combi/research/bic/glascontest). By data augmentation (DA), the training set is extended to 4950, and the test set is 1573. DA is achieved by rotating, deforming, twisting, cropping, and noise adding.^{[16]} According to the extent of deterioration information provided by Warwick, benign, benign adenoma, malignant moderate differentiation, malignant moderate to poor, and malignant are classified into five categories. The training set consisted of 21 cases of benign, 16 benign adenomas, 24 malignant moderate differentiation, 12 malignant moderate to poor, and 12 malignant. The test set consisted of 21 cases of benign, 16 benign adenomas, 23 malignant moderate differentiation, 8 malignant moderate to poor, and 12 malignant.
Results   
Based on the methods mentioned above, this paper constructs a prediction model for the deterioration of cell carcinoma based on deep learning and designs a depth learning training network and a prediction network. Compared to the training network, the prediction network has no training layer and test data layer, without loss layer and precision layer, but more than input layer of test picture and the Softmax layer of output likelihood. Select SGD as the learning strategy of this paper and iterate 170,000 times for model training. In prediction model, the final training loss is 0.0168896, test loss is 0.0356889, and the prediction accuracy is 87.38%; then, 5 pictures with different deterioration degree were predicted using the trained prediction model of deterioration degree. The prediction accuracy benign, benign adenomas, malignant moderate differentiation, malignant moderate to poor, and malignant are 87.21%, 94.17%, 89.33%, 81.58%, and 90.91%, respectively. For detailed data, please refer to [Table 1]. The decline curve of loss and the ascent curve of accuracy for training prediction model are shown in [Figure 3].
As can be seen from [Figure 3], both the loss value and the accuracy have a good change curve, the final loss is below 0.1 and the final accuracy is above 0.85, where an ideal result has been achieved.
For example, respectively, in [Figure 4] benign (a) and malignant (b) slice to predict them as the input file prediction model (a), the prediction accuracy is benign (a) for 92.53% and malignant (b) for 87.91%.
Discussion   
On the one hand, the data increasing contributes greatly to the results for this experiment. The data sets provided by Warwick contain data with 5 different labels, and they are divided into training sets and test sets according to the training algorithm of machine learning. While the original data provided by Warwick is less. If the data set is trained on the original data, the machine learning algorithm such as convolutional neural network cannot extract enough classification features, which will lead to underfitting or overfitting. By increasing data, the model generated by the convolution neural network training has a good nonlinear fitting ability, and the generalization ability of the model is extended. Before the data increasing, the accuracy of the test is about 20% and even jumps randomly around the random guess value. The main reason for this result is severe lack of data. The experimental results have proved that the data increasing is a good solution to this problem, and the accuracy of the test is 87.38%.
On the other hand, the feature extraction layer of convolutional neural network is also important for experiments. A good convolutional neural network should satisfy three conditions: powerful feature extraction ability, few parameters, and fast convergence.
For powerful feature extraction ability, high level features are usually extracted by convolution operation and sampling. The number of the extracted feature channels is determined by the number of filters in the convolution layer. The width and step length of the filter determine the size and quality of the extracted feature map. Therefore, in the training process, we suggest constantly adjusting convolutional and sampling level parameters to ensure the feature extraction ability of network, and the regulation method has been given in this paper. For reducing connection parameters and the training parameters can be solved by parameter sharing and sparse connection. The convolution and maximum pooling in this paper contain two kinds of ideas, which can greatly reduce parameters. For faster convergence, it has been explained in the previous part. Although SGD is not an adaptive optimization method, its violent optimization also has good convergence effect, in many cases. Compared with other optimization methods, it reduces the setting of super parameters and converges more quickly.
In terms of the training iteration, the definition of iterative values in Caffe is that takes every batch as an iteration where is different with Keras or Tensorflow. Referring to the experimental results in this paper, the test accuracy is about 70%, when the training is iterated 100,000 times. Even though this accuracy is not high enough, as long as the training is not convergent and no overfitting, it is possible to increase the number of iterations and continue to optimize the objective function, so as to increase the accuracy of the test. When the test accuracy does not rise or rise in a slight value, the training can be stopped because the target function has reached the optimum at this time. In this paper, we set the number of iterations to 170,000 on the basis of many tests.
The pathological sections of different degrees of deterioration were stained with hematoxylin and eosin, so the morphological differences were obviously different. This is important to not only pathologists but also the training and prediction of the model where is the feature that the neural network needs to learn. In the prediction of five different degree of deterioration of pathological sections with the trained model, it shows good generalization ability; the prediction accuracy has achieved a good result which can assist the pathologist and physician diagnosis, reduce the workload, and improve the diagnosis accuracy.
In this paper, a convolution neural network is used to predict the degree of pathological section deterioration, which proves the feasibility and virtuous of the method. The purpose is to extend the deep learning technology to the medical field, provide ideas and directions for all researchers, and provide an excellent prediction model, which can be applied to the field of auxiliary diagnosis.
Conclusions   
In view of the fact that the cell image pathological recognition data set contains less irrelevant information, the method of deep learning based on convolutional neural network is used to predict the deterioration degree of cell carcinoma. Through constructing the network of deep training, the prediction model of deterioration degree is trained. The results show that the prediction accuracy of the method based on depth learning is 87.38%. This model can be used to predict the degree of deterioration of cell carcinoma and can assist pathologists in related studies. This method has the advantages of depth modeling and high prediction accuracy, and it is easy to be used in clinical study of pathologists.
Acknowledgment
This work is supported by the NSFC under Grant No. 11265007 and in part by CSC.
Financial support and sponsorship
This work is supported by the NSFC under Grant No. 11265007 and in part by CSC. The paper's data sets are supported by Warwick.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Araújo T, Aresta G, Castro E, Rouco J, Aguiar P, Eloy C, et al. Classification of breast cancer histology images using Convolutional Neural Networks. PLoS One 2017;12:e0177544. 
2.  Sun W, Tseng TB, Zhang J, Qian W. Enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data. Comput Med Imaging Graph 2017;57:49. 
3.  Vincent P, Larochelle H, Bengio Y, Manzagol PA. Extracting and Composing Robust Features with Denoising Autoencoders. Proceedings of the 25 ^{th} International Conference on Machine Learning. New York: ACM Press; 2008. p. 1096103. 
4.  Jifeng D, Kaiming H, Jian S. Convolutional Feature Masking for Joint Object and Stuff Segmentation. Computer Vision and Pattern Recognition (CVPR). Boston: IEEE; 2015. 
5.  Yangqing J, Shelhamer E, Donghue J, Sergey K, Jonathan L, Ross G, et al. Caffe: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 22 ^{nd} ACM International Conference on Multimedia. Orlando: ACM; 2014. p. 6758. 
6.  Alberico A. Analysis of the process of visual pattern recognition by the neocognitron: Kunihiko fukushima. Commun Partial Differ Equ 2016;1 Suppl 1:22. 
7.  Yiming H, Di W, Zhifen Z, Huabin C, Shanben C. EMDbased pulsed TIG welding process porosity defect detection and defect diagnosis using GASVM. J Mater Process Technol 2017;239:92102. 
8.  Kai Y, Wei X, Yihong G. Deep Learning with Kernel Regularization for Visual Recognition. Proceedings of the TwentySecond Annual Conference on Neural Information Processing Systems. Vancouver: NIPS Proceedings; 2008. p. 188996. 
9.  Yuyao H, Baoqi L. A combinatorial deep learning model for learning rate strategies. Acta Automatica Sin 2016;42:9538. 
10.  Yi Y, Bin W. Deep Learning: Caffe Classic Model, Detailed and Practical. Beijing: Publishing House of Electronics Industry; 2016. 
11.  Bordes A, Bottou L, Gallinari P. Sgdqn: Careful quasinewton stochastic gradient descent. J Mach Learn Res 2009;10:173754. 
12.  KingmaI DP, Ba J. Adam: A Method for Stochastic Optimization. Computer Science; 2014. 
13.  ZeilerE MD. Adadelta: An Adaptive Learning Rate Method. Computer Science; 2012. 
14.  Li D. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the web]. Vol. 29. IEEE Signal Processing Magazine; 2012. p. 1412. 
15.  Sirinukunwattana K, Snead DR, Rajpoot NM. A Stochastic Polygons Model for Glandular Structures in Colon Histology Images. IEEE Transactions on Medical Imaging; 2015. 
16.  SimardI PY, Steinkraus D, Plarr JC. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. Proceedings of the Seventh International Conference on Document Analysis and Recognition. Washington: IEEE Computer Society; 2003. p. 958. 
[Figure 1], [Figure 2], [Figure 3], [Figure 4]
[Table 1]
