In this paper, we propose a method to improve image classification performance using the fusion of CNN and transformer structure. case CNN, information about local area on an can be extracted well, but global extraction is limited. On other hand, has advantage in extraction, it requires much memory compared CNN. We apply consider feature vector each pixel resulting map by as token. At same time...