A common challenge faced during learning-based image classification is the need for a large number of annotated training examples. This poses a considerable downside of such classification methods as labeling data is costly in terms of human effort.
The authors of this paper propose a method that tackles this issue by incorporating active learning within the context of deep learning for image classification.
In this section we briefly discuss the two apparently conflicting ingredients that are unified by the authors for their proposed approach: deep learning and active learning.
A. Deep convolutional neural nets for image classification
Deep learning has proven to be a robust and successful method that performs well on classification-related tasks for images. Specifically, convolutional neural networks (CNN) have shown to provide good results as they directly learn features from raw pixels; it learns how to extract features and eventually infers what object the pixels constitute.
CNNs, while successful, have the disadvantage that they require a large number of annotated images. This poses a challenge as annotation of images is expensive not only in terms of time and human effort, due to the expertise needed to do so, but also in terms of the impact a poor annotation can have in sensitive scenarios. An example of this is the annotation of images associated with malignant diseases in patients. Different medical experts may have conflicting opinions in a situation where consensus needs to be reached.
B. Active learning
It is clear that nowadays the volume of data individuals have access to is constantly growing. Making use of it for predictive tasks through modelling, although attractive, poses a great difficulty as the majority of data one can access is not labeled.
The previous difficulty helps make a case for active learning (AL). One aspect behind AL is to improve models that already exist by selecting and annotating unlabeled data samples that show to be most informative. Most informative samples refer to samples where the model is most uncertain and they are found based on three different approaches; such approaches are discussed in Sections III A and III B.
The procedure consists of increasing, in an iterative manner, the size of a previously selected labeled sample. The intention of this procedure is to achieve at least similar performance, if not better, to the one reached when using a fully labeled dataset. Moreover, the cost of performing this iterative process is in principle lower than what it takes to annotate the unlabeled data present in the dataset.
It is important to point out that AL inherently assumes that the feature representation is fixed, which is what allows the possibility of extracting the most informative sample from the original dataset.
In some cases it has been observed that AL methods work by having a model initialized with a small set of labeled data. Then the model is boosted by selecting the most informative samples and sending them to an oracle for annotation. This idea, although promising, faces the problem that the model used for predictions is trained on small datasets and making use of hand-wroughted features.
C. Difficulties in incorporating convolutional neural networks into active learning
The discussion above would suggest that deep learning can benefit from active learning even in the presence of limited labeled data. This is, however, not the case as the AL and DL processes are conflicting. This is due to the fact that, as opposed to CNN, AL extracts the most informative data points relying on the assumption that feature representation is static.
This conflict is what motivates the authors to present the approach discussed: cost-effective active learning (CEAL).
III. Cost-effective active learning (CEAL)
In this section we discuss the key idea of CEAL that allows it to avoid the conflict between AL and CNNs mentioned in section II. C. We also discuss previous approaches CEAL improves upon as well as details on how CEAL works and its performance on two datasets.
A. Active learning methods
The main goal of active learning is to make a learning algorithm that is capable of querying an oracle for labels and it choses examples from which to learn the most.
As a result of this, the mechanism which selects the most useful data points is crucial for AL to be successful. There are a few mechanisms that have been explored for this purpose:
- Least confidence. The confidence values are simply the highest class probability score for each example. These values are then ranked, and the lowest valued entries are the ‘least confident’ ones.
- Margin sampling. The smaller the margin sampling value is, the more uncertain an example is. It is computed as the difference of probabilities of an example belonging to either the first or second most probable classes predicted by the classifier.
- Entropy. The larger the entropy is the more uncertain an example is. Entropy is computed as the usual information entropy.
B. The CEAL algorithm
In order to determine the weights of the CNN classifier , CEAL relies on an AL approach that automatically assigns pseudo-labels – labels created with no human intervention - to the unlabelled examples in the dataset. The implementation of this specific AL approach, in combination with a deep CNN, is the key contribution of the authors and what gives rise to CEAL.
From now on we denote the dataset of interest as where represents the sample of the dataset and its corresponding label. Each label can take one of the values in the set . This means we have a dataset with n examples and m categories.
The dataset of interest is the union of the set of labeled examples, , and the set of unlabeled examples . That is, . Let us keep in mind that the cardinality of and will change as iterations in CEAL progresses. Initially is the empty set and its cardinality increases gradually as examples from are moved into .
In what follows we explain the steps invoked to learn the optimal value of weights for this CNN.
Weights of the deep CNN are initialized to classify a few labeled examples. These labeled examples are obtained from manually annotated examples that have been randomly drawn from.
In addition, one needs to choose an integer number representing the number of iterations, an integer number representing the number informative samples, a real number representing a threshold to identify high confidence samples, which represents the threshold decay rate, and a learning rate based on which the weights of the CNN will be updated. Details and relevance of these parameters will become clear as we present the steps captured by the algorithm.
Examples from are fed into the CNN to generate two groups of samples: informative samples and high confidence samples.
Informative samples are chosen based on one of the following criteria:
Least confidence. examples with lowest confidence value - the smaller the confidence value is the more uncertain the example is. Confidence value is computed as
Margin sampling. examples with lowest margin sampling - the smaller the margin sampling value is the more uncertain the example is. This quantity is computed as
where and represent, respectively, the first and second most probable class predicted by the classifier.
Highest entropy. examples with highest entropy value are chosen - the larger the entropy the more uncertain the example is. The entropy of sample , , is computed as
In all three cases just described, represents the probability of belonging to the class.
High confidence samples are chosen as the samples whose entropy is smaller than the threshold . Based on this, the pseudo-label of the example is defined as , where and is the Heaviside function - its value is 1 if the argument is greater than zero and zero otherwise.
Samples with high confidence are captured under a set denoted as .
Fine tune the CNN by updating the threshold of high confidence samples and its weights based on back propagation. This is achieved by the following iteration. For :
Update weights based on learning rate and gradient of the cost function given by
where represents the indicator function and . Notice this implies the first summation is performed over the examples belonging to the union of high confidence samples and labelled samples.
Once this update has been performed, remove pseudo-labels from samples in and put them back into .
Update threshold of high confidence as , where ; threshold of high confidence initially chosen by the user.
Note that the steps spelled out above implies CEAL has the following hyperparameters:
- number of iterations criteria to choose informative samples;
- number of informative samples ;
- threshold for informative samples;
- threshold decay rate for high confidence samples;
- learning rate to update CNN weights.
The following picture displays, roughly speaking, the CEAL process described in this section.
C. State-of-the-art method compared against CEAL
CEAL is compared with baseline and state-of-the-art methods to provide evidence of its improvement of the existing leading method. The three reference methods are:
Active learning all (AL_ALL). All training samples are manually labeled and are used to train CNN. This approach naturally constitutes the upper bound for performance as all data points for training are labelled.
Active learning random (AL_RAND). Samples to be labelled are randomly selected to fine-tune the CNN. This approach constitutes a lower bound as active learning techniques are not incorporated.
Triple Criteria Active Learning (TCAL). State state-of-the-art active learning method that identifies an effective and small as a possible set of irrelevant and relevant examples based on three criteria: uncertainty and diversity - it selects the most informative examples in the dataset - and density - it selects examples that are representative of the distribution captured in the dataset.
In order to compare CEAL with TCAL, authors prepared a TCAL pipeline by building an SVM classifier and applied uncertainty based on sampling margin strategy, diversity by clustering the most uncertain samples via k-means and density by computing the average distance between a reference data point with other samples within the cluster the reference point belongs to; the point with the smallest average distance (largest density) is the most informative sample.
In the following section we briefly discuss the experiments set to the comparison mentioned as well as the result of such experiments.
D. Datasets, experiments and results
Comparison of CEAL with the three reference methods mentioned in Section III-C was performed on Cross-Age Celebrity face recognition Dataset (CACD) and Caltech-256 object categorization dataset. The former consists of over 160000 images while the latter contains 30607 images of 256 categories. Each dataset was resized and for each of the different parameters of CEAL algorithm were chosen:
- CACD was resized into and the parameters , , , for all layers and .
- Caltech-256 was resized into and the parameters , , , for softmax layer and for all other layers while .
Also, from Section III-B we learned that informative samples can be selected based on three different criteria: least confidence, margin sampling and highest entropy. We denote each of these methods, respectively, CEAL_LC, CEAL_MS and CEAL_EN.
In this section we present the results of different experiments based on different combinations between AL methods against CEAL is compared (AL_RAND, AL_ALL or TCAL) and CEAL selection criteria (CEAL_LC, CEAL_MS or CEAL_EN).
Also, a different experiment is proposed to assess the raw performance of large confidence (LC), margin sampling (MS) and highest entropy (EN) criteria. This experiment consists of disregarding the cost-effective high confidence sample selection step in the algorithm (step 2 of algorithm of Section III-B) and incorporating it into the AL_RAND active learning strategy. This strategy is denoted by the authors as CEAL_RAND. That is, since AL_RAND randomly selects samples to be annotated, CEAL_RAND reflects the original contribution of the pseudo-labeled majority high confidence sample strategies.
All plots below show the classification accuracy for different fractions of annotated samples.
CEAL_MS performance. On both datasets CEAL_MS is compared with three AL methods, AL_ALL, AL_RAND and TCAL.
For all fraction of labeled samples, CEAL with margin selection (MS) selection criteria, the graphs suggest that CEAL_MS outperforms AL_RAND and TCAL.
From the numerical values from which these plots are generated, though, it is possible to see that at fractions between 0.7 and 0.8, both CEAL_MS and TCAL are both fairly competitive with AL_ALL and very close to one another in performance. This suggests that the increase in performance offered by CEAL_MS is marginal over TCAL.
*Least confidence CEAL and AL. CEAL_LC is compared with AL_RAND and AL_ALL active methods on both datasets. Comparison is also done with the respective active learning (in this case least confidence). CEAL_RAND is included as a reference. Panel on the left refers to CACD dataset while the one on the right is for Caltech-256.
Margin sampling CEAL and AL. Similar to the previous case, CEAL_MS is compared with AL_RAND and AL_ALL active methods on both datasets. Comparison is also done with the respective active learning method (in this case margin sampling). CEAL_RAND is once again included as a reference. Panel on the left refers to CACD dataset while the one on the right is for Caltech-256.
Highest entropy CEAL and AL. Similar to the previous case, CEAL_EN is compared with AL_RAND and AL_ALL active methods on both datasets. Comparison is also done with the respective active learning (in this case entropy). CEAL_RAND is once again included as a reference. Panel on the left refers to CACD dataset while the one on the right is for Caltech-256
In experiment 2, 3 and 4 we observe that CEAL when combined with each selection criteria (least confidence, margin sampling and entropy) outperforms active learning combined with the corresponding selection criteria. This improvement, however, seems to be marginal when the fraction of labeled samples get closer to a value between 0.7 and 0.8. This observation is true on both datasets and may suggest that most of the performance is already captured by active learning methods, which would imply that selection criteria has little impact in performance improvement. This is somehow confirmed by the comparison between CEAL_RAND and the active learning method when combined with the corresponding selection criteria.
As a final experiment the authors propose combining CEAL_LC, CEAL_MS and CEAL_EN in the following way. On each iteration of the algorithm, top samples are selected according to each selection criteria, resulting in three sets of data: , and . These three datasets are then combined and duplicates are removed to form the set . Lastly, from random points are taken from it for annotation. This method is denoted by the authors as CEAL_FUSION.
The result of this experiment, in terms of accuracy versus fraction of labeled samples, is shown in the following figure.
The three different approaches, CEAL_LC, CEAL_MS and CEAL_EN perform similarly for every fraction of labeled samples while CEAL_FUSSION outperforms each of them.
IV. Summary and general comments
The authors propose a way to incorporate active learning into deep learning by selecting informative samples based on three different criteria: least confidence, margin sample and highest entropy. In addition, the authors also propose a mechanism to incorporate high-confidence pseudo-labels.
The approach outperforms state-of-the-art active learning methods in classification tasks when executed in the CACD and Caltech-256 datasets and the ballpark performance is not dramatically sensitive to the selection criteria selected. The improvement is, however, marginal.
One of the advantages of the method is that the number of informative samples needed to reach such performance is a small fraction of the whole dataset - less than 5% in both datasets explored.
The method depends on a number of user-defined parameters to update CNN weights. Although reported, it is not clear how the values chosen for the experiments were selected by the user. An interesting piece of future work would be a study of how sensitive the performance of the method is to the thresholds for informative samples and high confidence samples.
V. Pytorch Implementation:
[Github link]: https://github.com/rafikg/CEALAll the details to use the code is explained in the repo.