I want to have a primary CNN based classifier and a similar secondary classifier for image regions.
Both classifiers will be used on image regions. I need the first classifier to be used on a primary region while the secondary classifier to be used on assistive regions and will be used to support the decision made by the first classifier with further evidence.
Thus the primary image region and the assistive ones will be used to infer one class label at a time.
What other ways or architectures exist these days to perform such a task, instead of ROI Pooling?
Ideally, I would like to have a classifier scheme similar to the one of this paper but without the use of ROI Pooling.
2 Answers
Answers 1
Yaw Lin's answer contains a good amount of information, I'll just build on what he said in his last paragraph. I think the essence of what you want to do is not so much to process the person and the background independently and compare the results (that's clearly what you said you're doing), but to process the background first and infer from it the kinds of expectations you have for the primary region. Once you have some expectations, you can compare the primary region to the most significant expectations.
For example, from Figure 1 (b) in your Arxiv link, if you can process the background and determine that it's outdoors in a highly populated region, then you can focus a lot of the probability density function of what the person is doing in social outdoor activities, making jogging much more likely as a guess before you even process the figure you're interested in. In contrast, for Figure 1 (a), if you can process the background and tell that it's indoors and contains computers, then you can focus probability on lone indoor computer-based activities, skyrocketing the probability of "working on a computer".
Answers 2
You can take a look at this https://arxiv.org/pdf/1611.10012.pdf which contains comprehensive survey of recent detection architectures. Basically there are 3 meta architectures and all models fall into one of those categories:
- Faster-RCNN: Similar to the paper you have referenced, this is the improved version of fast-rcnn which did not use selective search and directly integrate proposal generation into the network known as region proposal network(rpn).
- RFCN:similar in architecture to 1, except that roi pooling is performed differently, known as position sensitive roi pooling.
- SSD:Modifies the rpn in Faster-rcnn to directly output class probabilities, eliminating the need for per-roi computation as is done in roi pooling. This is the fastest architecture type. Yolo falls into this architecture.
I think based on my rough read through of the paper you have referenced, type 3 is the one you are looking for. However, in terms of implementation, it can be a little tricky to implement equation 3, i.e. you may need to stop backpropagating gradients to regions(or at least think about how it could effect the final results) that do not overlap with primary region as this architecture type computes probabilities for whole image.
I also note that there is in fact no primary/secondary "classifiers". The paper described primary/secondary "regions", the primary region is the region that contains the person(i.e use a person detector to find primary region first). And secondary regions are those that overlap with primary region. For activity classification there is only one classifier except that primary region carries more weight and secondary regions each contribute a little to the final prediction score.
0 comments:
Post a Comment