본문 바로가기
HOME> 저널/프로시딩 > 저널/프로시딩 검색상세

저널/프로시딩 상세정보

권호별목차 / 소장처보기

H : 소장처정보

T : 목차정보

Computer vision and image understanding : CVIU 13건

  1. [해외논문]   Inside Front Cover - Editorial Board Page/Cover image legend if applicable  


    Computer vision and image understanding : CVIU v.163 ,pp. IFC , 2017 , 1077-3142 ,

    초록

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  2. [해외논문]   Inside Front Cover - Editorial Board Page/Cover image legend if applicable   SCI SCIE SCOPUS


    Computer vision and image understanding : CVIU v.163 ,pp. IFC - IFC , 2017 , 1077-3142 ,

    초록

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  3. [해외논문]   Guest Editorial: Language in Vision   SCI SCIE SCOPUS

    Yan, Yan (Department of Computer Science, Texas State University, USA ) , Lu, Jiwen (Department of Automation, Tsinghua University, China ) , Mian, Ajmal (Department of Computer Science and Software Engineering, University of Western Australia, Australia ) , Ross, Arun (Department of Computer Science and Engineering, Michigan State University, USA ) , Murino, Vittorio (Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Italy ) , Horaud, Radu (INRIA Grenoble Rhone-Alpes, France)
    Computer vision and image understanding : CVIU v.163 ,pp. 1 - 2 , 2017 , 1077-3142 ,

    초록

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  4. [해외논문]   Visual question answering: Datasets, algorithms, and future challenges   SCI SCIE SCOPUS

    Kafle, Kushal (Corresponding author.) , Kanan, Christopher
    Computer vision and image understanding : CVIU v.163 ,pp. 3 - 20 , 2017 , 1077-3142 ,

    초록

    Abstract Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research. Highlights Comparison of visual question answering (VQA) with related computer vision tasks. Critical review of all major VQA datasets and evaluation metrics. Comprehensive review and comparison of existing methods for VQA. All major datasets have language and difficulty bias that critically affects VQA. Recommendations for future VQA datasets and evaluation metrics to combat bias.

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  5. [해외논문]   Visual question answering: A survey of methods and datasets   SCI SCIE SCOPUS

    Wu, Qi (Corresponding author.) , Teney, Damien , Wang, Peng , Shen, Chunhua , Dick, Anthony , van den Hengel, Anton
    Computer vision and image understanding : CVIU v.163 ,pp. 21 - 40 , 2017 , 1077-3142 ,

    초록

    Abstract Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models. Highlights A comprehensive review of the state of the art on the emerging task of visual question answering Review the growing number of datasets, highlighting their distinct characteristics An in-depth analysis of the questions/answers provided in the recently-released Visual Genome dataset.

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  6. [해외논문]   Vision-language integration using constrained local semantic features   SCI SCIE SCOPUS

    Tamaazousti, Youssef (Corresponding author at: CEA, LIST, Laboratory of Vision and Content Engineering, F-91191 Gif-sur-Yvette, France. ) , Le Borgne, Hervé (CEA, LIST, Laboratory of Vision and Content Engineering, F-91191 Gif-sur-Yvette, France ) , (CEA, LIST, Laboratory of Vision and Content Engineering, F-91191 Gif-sur-Yvette, France ) , Popescu, Adrian (CEA, LIST, Laboratory of Vision and Content Engineering, F-91191 Gif-sur-Yvette, France ) , Gadeski, Etienne (CEA, LIST, Laboratory of Vision and Content Engineering, F-91191 Gif-sur-Yvette, France ) , Ginsca, Alexandru (University of Paris-Saclay, Mathematics in Interaction with Computer Science (MICS), 92295 Châtenay-Malabry, France) , Hudelot, Cé , line
    Computer vision and image understanding : CVIU v.163 ,pp. 41 - 57 , 2017 , 1077-3142 ,

    초록

    Abstract This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification. Highlights Vision and language integration at two levels, including the semantic level. A semantic signature that adapts its sparsity to the actual visual content of images. CNN-based mid-level features boosting semantic signatures. Top performances on publicly available benchmarks for several tasks. Graphical abstract [DISPLAY OMISSION]

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  7. [해외논문]   Recognizing semantic correlation in image-text weibo via feature space mapping   SCI SCIE SCOPUS

    Liu, Maofu (College of Computer Science and Technology, Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan 430065, China ) , Zhang, Luming (School of Computer and Information, Hefei University of Technology, Hefei 230009, China ) , Liu, Ya (College of Computer Science and Technology, Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan 430065, China ) , Hu, Huijun (Corresponding author. ) , Fang, Wei (Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science and Technology, Nanjing 210044, China)
    Computer vision and image understanding : CVIU v.163 ,pp. 58 - 66 , 2017 , 1077-3142 ,

    초록

    Abstract Recent years have witnessed the fast development of social media platforms, such as Twitter, Sina Weibo, and Wechat. Practically, the textual weibos are frequently uploaded with images, namely image-text weibos in this paper. To gain the deep insights into the semantics of the image-text weibos, this paper explores the semantic correlation between the image and text. The semantic correlation recognition approach based on feature space mapping and support vector machine has been developed, due to the heterogeneity and incomparability of image, text, and social multi-source information in image-text weibos. Our model firstly extracts three types of features, namely, textual-linguistic, visual, and social features. It then uses the genetic algorithm to project the features from the different feature spaces to the unified one. At last, the semantic correlation recognition model based on support vector machine is constructed in the unified feature space. The experimental results show that the accuracy of our recognition model for semantic correlation between image and text in image-text weibo, with feature space mapping and support vector machine using the three types of multi-source features, achieves a significant performance compared to the traditional model only based on support vector machine. Highlights We make effort on recognizing semantic correlation between image and text in image-text weibo. We make an in-depth study of weibo text from language cognitive and computational views and extract textual-linguistic features, from weibo text. We select visual feature space as the unified one and the genetic mapping algorithm is used to transform textual-linguistic and social features into the unified feature space to eliminate the heterogeneity and incomparability of the different types of features.

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  8. [해외논문]   Simple to complex cross-modal learning to rank   SCI SCIE SCOPUS

    Luo, Minnan (SPKLSTN Lab, Department of Computer Science, Xi'an Jiaotong University, Xi'an, China ) , Chang, Xiaojun (Corresponding author. ) , Li, Zhihui (Faculty of Engineering and Information Technology, University of Technology Sydney, Australia ) , Nie, Liqiang (School of Computing, National University of Singapore, Singapore ) , Hauptmann, Alexander G. (School of Computer Science, Carnegie Mellon University, PA, USA ) , Zheng, Qinghua (SPKLSTN Lab, Department of Computer Science, Xi'an Jiaotong University, Xi'an, China)
    Computer vision and image understanding : CVIU v.163 ,pp. 67 - 77 , 2017 , 1077-3142 ,

    초록

    Abstract The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval. Some studies formalize the cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal embedding space to measure the cross-modality similarity. However, previous methods often establish the shared embedding space based on linear mapping functions which might not be sophisticated enough to reveal more complicated inter-modal correspondences. Additionally, current studies assume that the rankings are of equal importance, and thus all rankings are used simultaneously, or a small number of rankings are selected randomly to train the embedding space at each iteration. Such strategies, however, always suffer from outliers as well as reduced generalization capability due to their lack of insightful understanding of procedure of human cognition. In this paper, we involve the self-paced learning theory with diversity into the cross-modal learning to rank and learn an optimal multi-modal embedding space based on non-linear mapping functions. This strategy enhances the model’s robustness to outliers and achieves better generalization via training the model gradually from easy rankings by diverse queries to more complex ones. An efficient alternative algorithm is exploited to solve the proposed challenging problem with fast convergence in practice. Extensive experimental results on several benchmark datasets indicate that the proposed method achieves significant improvements over the state-of-the-arts in this literature. Highlights We learn a more optimal multi-modal embedding space gradually from easy to more complex rankings. We employ non-linear mapping functions to establish the multi-modal embedding space for more sophisticated cross-modal correspondence. An efficient alternative algorithm is exploited to solve the proposed challenging problem with a fast convergence in practice.

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  9. [해외논문]   Weakly supervised learning of actions from transcripts   SCI SCIE SCOPUS

    Kuehne, Hilde (Corresponding author:) , Richard, Alexander , Gall, Juergen
    Computer vision and image understanding : CVIU v.163 ,pp. 78 - 89 , 2017 , 1077-3142 ,

    초록

    Abstract We present an approach for weakly supervised learning of human actions from video transcriptions. Our system is based on the idea that, given a sequence of input data and a transcript, i.e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream and to learn the related action models without the need for any frame-based annotation. Starting from the transcript information at hand, we split the given data sequences uniformly based on the number of expected actions. We then learn action models for each class by maximizing the probability that the training video sequences are generated by the action models given the sequence order as defined by the transcripts. The learned model can be used to temporally segment an unseen video with or without transcript. Additionally, the inferred segments can be used as a starting point to train high-level fully supervised models. We evaluate our approach on four distinct activity datasets, namely Hollywood Extended, MPII Cooking, Breakfast and CRIM13. It shows that the proposed system is able to align the scripted actions with the video data, that the learned models localize and classify actions in the datasets, and that they outperform any current state-of-the-art approach for aligning transcripts with video data. Highlights A HMM based system for weakly supervised action segmentation is proposed. Only a sequence of occurring actions is needed, no framewise annotation required. CNN features can easily be included in the framework. Our model shows state-of-the-art performance on various datasets.

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지
  10. [해외논문]   Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?   SCI SCIE SCOPUS

    Das, Abhishek (Corresponding author. ) , Agrawal, Harsh (Virginia Tech, Blacksburg, VA, USA ) , Zitnick, Larry (Facebook AI Research, Menlo Park, CA, USA ) , Parikh, Devi (Georgia Institute of Technology, Atlanta, GA, USA ) , Batra, Dhruv (Georgia Institute of Technology, Atlanta, GA, USA)
    Computer vision and image understanding : CVIU v.163 ,pp. 90 - 100 , 2017 , 1077-3142 ,

    초록

    Abstract We conduct large-scale studies on ‘human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans. Finally, we train VQA models with explicit attention supervision, and find that it improves VQA performance. Highlights Multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (). Qualitative and quantitative comparison of the maps generated by state-of-the-art attention-based VQA models () and a task-independent saliency baseline. () against our human attention maps through visualizations and rank-order correlation VQA model trained with explicit supervision for attention using our human attention maps as ground truth.

    원문보기

    원문보기
    무료다운로드 유료다운로드

    회원님의 원문열람 권한에 따라 열람이 불가능 할 수 있으며 권한이 없는 경우 해당 사이트의 정책에 따라 회원가입 및 유료구매가 필요할 수 있습니다.이동하는 사이트에서의 모든 정보이용은 NDSL과 무관합니다.

    NDSL에서는 해당 원문을 복사서비스하고 있습니다. 아래의 원문복사신청 또는 장바구니담기를 통하여 원문복사서비스 이용이 가능합니다.

    이미지

    Fig. 1 이미지

논문관련 이미지