Welcome to the IKCEST
Journal
IEEE Transactions on Multimedia

IEEE Transactions on Multimedia

Archives Papers: 596
IEEE Xplore
Please choose volume & issue:
Hierarchical Attention Network for Visually-Aware Food Recommendation
Xiaoyan GaoFuli FengXiangnan HeHeyan HuangXinyu GuanChong FengZhaoyan MingTat-Seng Chua
Keywords:VisualizationRecommender systemsCollaborationEncodingTask analysisFeature extractionHistoryFood Recommender SystemsHierarchical AttentionCollaborative FilteringIngredientsRecipe Image
Abstracts:Food recommender systems play an important role in assisting users to identify the desired food to eat. Deciding what food to eat is a complex and multi-faceted process, which is influenced by many factors such as the ingredients, appearance of the recipe, the user&#x0027;s personal preference on food, and various contexts like what had been eaten in the past meals. This work formulates the food recommendation problem as predicting user preference on recipes based on three key factors that determine a user&#x0027;s choice on food, namely, 1) the user&#x0027;s (and other users&#x2019;) history; 2) the ingredients of a recipe; and 3) the descriptive image of a recipe. To address this challenging problem, this work develops a dedicated neural network-based solution <italic>Hierarchical Attention based Food Recommendation</italic> (HAFR) which is capable of: 1) capturing the collaborative filtering effect like what similar users tend to eat; 2) inferring a user&#x0027;s preference at the ingredient level; and 3) learning user preference from the recipe&#x0027;s visual images. To evaluate our proposed method, this work constructs a large-scale dataset consisting of millions of ratings from <uri>AllRecipes.com</uri>. Extensive experiments show that our method outperforms several competing recommender solutions like Factorization Machine and Visual Bayesian Personalized Ranking with an average improvement of 12&#x0025;, offering promising results in predicting user preference on food.
Visual-Texual Emotion Analysis With Deep Coupled Video and Danmu Neural Networks
Chenchen LiJialin WangHongwei WangMiao ZhaoWenjie LiXiaotie Deng
Keywords:Danmudeep multimodal learningemotion analysis
Abstracts:User emotion analysis toward videos is to automatically recognize the general emotional status of viewers from the multimedia content embedded in the online video stream. Existing works fall into two categories: 1) <italic>visual-based</italic> methods, which focus on visual content and extract a specific set of features of videos. However, it is generally hard to learn a mapping function from low-level video pixels to high-level emotion space due to great intra-class variance. 2) <italic>textual-based</italic> methods, which focus on the investigation of user-generated comments associated with videos. The learned word representations by traditional linguistic approaches typically lack emotion information and the global comments usually reflect viewers&#x2019; high-level understandings rather than instantaneous emotions. To address these limitations, in this paper, we propose to jointly utilize video content and user-generated texts simultaneously for emotion analysis. In particular, we introduce exploiting a new type of user-generated texts, i.e., &#x201C;danmu,&#x201D; which are real-time comments floating on the video and contain rich information to convey viewers&#x2019; emotional opinions. To enhance the emotion discriminativeness of words in textual feature extraction, we propose <italic>Emotional Word Embedding</italic> (EWE) to learn text representations by jointly considering their semantics and emotions. Afterward, we propose a novel visual-textual emotion analysis model with <italic>Deep Coupled Video and Danmu Neural networks</italic> (DCVDN), in which visual and textual features are synchronously extracted and fused to form a comprehensive representation by deep-canonically-correlated-autoencoder-based multi-view learning. Through extensive experiments on a self-crawled real-world video-danmu dataset, we prove that DCVDN significantly outperforms the state-of-the-art baselines.
Asymmetric Joint GANs for Normalizing Face Illumination From a Single Image
Xianjun HanHongyu YangGuanyu XingYanli Liu
Keywords:LightingFaceGallium nitrideFace recognitionTask analysisDatabasesThree-dimensional displaysface illumination normalizationface relightinggenerative adversarial networksface identityimage translation
Abstracts:Illumination normalization for face recognition is very important when a face is captured under harsh lighting conditions. Instead of designing hand-crafted features, in this paper we formulate face illumination normalization as an image-to-image translation task. A great challenge of face normalization is that human facial structures are particularly sensitive to image structure distortion, which frequently occurs in traditional image-to-image translation tasks. Unfortunately, sometimes even slight facial structure distortions may prohibit human eyes and machine face recognition methods from identifying face identities. To address this issue, a novel GAN-based network architecture called the asymmetric joint generative adversarial network (AJGAN) is developed to normalize face images under arbitrary illumination conditions, without known face geometry and albedo information. In addition, an illumination normalization GAN <inline-formula><tex-math notation="LaTeX">$G_1$</tex-math></inline-formula> and an asymmetric relighting GAN <inline-formula><tex-math notation="LaTeX">$G_2$</tex-math></inline-formula> that maps a frontal-illuminated image to images with various lighting conditions are incorporated in AJGAN to maintain personalized facial structures. To avoid image blurring caused by the under-constrained relighting mapping, we introduce a scheme of one-hot lighting labels into <inline-formula><tex-math notation="LaTeX">$G_2$</tex-math></inline-formula> and enforce label classification loss. Furthermore, the number of training images starting from a very limited number of labels is dynamically extended by the combination of different lighting labels. Qualitative and quantitative experiments on three databases validate that AJGAN significantly outperforms the state-of-the-art methods.
Concentrated Local Part Discovery With Fine-Grained Part Representation for Person Re-Identification
Chaoqun WanYue WuXinmei TianJianqiang HuangXian-Sheng Hua
Keywords:Feature extractionVisualizationCamerasConvolutional neural networksHeadTorsoLegged locomotionPerson re-identificationlocal part learningconstraint attention mechanismfine-grained representation
Abstracts:The attention mechanism for person re-identification has been widely studied with deep convolutional neural networks. This mechanism works as a good complement to the global features extracted from an image of the entire human body. However, existing works mainly focus on discovering local parts with simple feature representations, such as global average pooling. Moreover, these works either require extra supervision, such as labeling of body joints, or pay little attention to the guidance of part learning, resulting in scattered activation of learned parts. Furthermore, existing works usually extract local features from different body parts via global average pooling and then concatenate them together as good global features. We find that local features acquired in this way contribute little to the overall performance. In this paper, we argue the significance of local part description and explore the attention mechanism from both local part discovery and local part representation aspects. For local part discovery, we propose a new constrained attention module to make the activated regions concentrated and meaningful without extra supervision. For local part representation, we propose a statistical-positional-relational descriptor to represent local parts from a fine-grained viewpoint. Extensive experiments are conducted to validate the overall performance, the effectiveness of each component, and the generalization ability. We achieve a rank-1 accuracy of 95.1&#x0025; on Market1501, 64.7&#x0025; on CUHK03, 87.1&#x0025; on DukeMTMC-ReID, and 79.9&#x0025; on MSMT17, outperforming state-of-the-art methods.
Uni-and-Bi-Directional Video Prediction via Learning Object-Centric Transformation
Xiongtao ChenWenmin Wang
Keywords:KernelPredictive modelsTask analysisBidirectional controlOptical imagingImage reconstructionVisualizationObject-centric motion transformationuni-and-bi-directional predictionvideo predictionvisual attention
Abstracts:Video prediction, including uni-directional prediction for future frames and bi-directional prediction for in-between frames, is a challenging task and a problem worth exploring in multimedia and computer vision fields. Existing practices usually make predictions by learning global motion information from the whole given image. However, humans often focus on key objects carrying vital motion information instead of the entire frame. Besides, different objects often show different movement and deformation, even in the same scene. In this connection, we build a novel model of object-centric video prediction, in which the motion signals of key objects are particularly learned. This model can predict new frames by repeatedly transforming objects into the original input images. To focus on these objects automatically, we create an attention module with substitutable strategies. Our method requires no annotated data, and we also use adversarial training to improve sharpness of the predictions. We evaluate our model through Moving MNIST, UCF101 and Penn Action datasets and achieve competitive results in both quantity and quality, compared to existing methods. The experiments demonstrate that our uni-and-bi-directional network can well predict motions for different objects and generate plausible future and in-between frames.
Coarse-to-Fine Localization of Temporal Action Proposals
Fuchen LongTing YaoZhaofan QiuXinmei TianTao MeiJiebo Luo
Keywords:ProposalsVideosPaintingBrushesMicrosoft WindowsTask analysisFeature extractionAction ProposalsAction RecognitionAction DetectionVideo Captioning
Abstracts:Localizing temporal action proposals from long videos is a fundamental challenge in video analysis (e.g., action detection and recognition or dense video captioning). Most existing approaches often overlook the hierarchical granularities of actions and thus fail to discriminate fine-grained action proposals (e.g., hand washing laundry or changing a tire in vehicle repair). In this paper, we propose a novel coarse-to-fine temporal proposal (CFTP) approach to localize temporal action proposals by exploring different action granularities. Our proposed CFTP consists of three stages: a coarse proposal network (CPN) to generate long action proposals, a temporal convolutional anchor network (CAN) to localize finer proposals, and a proposal reranking network (PRN) to further identify proposals from previous stages. Specifically, CPN explores three complementary actionness curves (namely pointwise, pairwise, and recurrent curves) that represent actions at different levels for generating coarse proposals, while CAN refines these proposals by a multiscale cascaded 1D-convolutional anchor network. In contrast to existing works, our coarse-to-fine approach can progressively localize fine-grained action proposals. We conduct extensive experiments on two action benchmarks (THUMOS14 and ActivityNet v1.3) and demonstrate the superior performance of our approach when compared to the state-of-the-art techniques on various video understanding tasks.
Sensor-Augmented Neural Adaptive Bitrate Video Streaming on UAVs
Xuedou XiaoWei WangTaobin ChenYang CaoTao JiangQian Zhang
Keywords:ThroughputStreaming mediaAccelerationHeuristic algorithmsBit rateUnmanned aerial vehiclesAdaptation modelsUnmanned aerial vehicleadaptive bitrate algorithmvideo streamingsensor-augmented systemdeep reinforcement learning
Abstracts:Recent advances in unmanned aerial vehicle (UAV) technology have revolutionized a broad class of civil and military applications. However, the designs of wireless technologies that enable real-time streaming of high-definition video between UAVs and ground clients present a conundrum. Most existing adaptive bitrate (ABR) algorithms are not optimized for the air-to-ground links, which usually fluctuate dramatically due to the dynamic flight states of the UAV. In this paper, we present SA-ABR, a new sensor-augmented system that generates ABR video streaming algorithms with the assistance of various kinds of inherent sensor data that are used to pilot UAVs. By incorporating the inherent sensor data with network observations, SA-ABR trains a deep reinforcement learning (DRL) model to extract salient features from the flight state information and automatically learn an ABR algorithm to adapt to the varying UAV channel capacity through the training process. SA-ABR does not rely on any assumptions or models about UAV&#x0027;s flight states or the environment, but instead, it makes decisions by exploiting temporal properties of past throughput through the long short-term memory (LSTM) to adapt itself to a wide range of highly dynamic environments. We have implemented SA-ABR in a commercial UAV and evaluated it in the wild. We compare SA-ABR with a variety of existing state-of-the-art ABR algorithms, and the results show that our system outperforms the best known existing ABR algorithm by 21.4&#x0025; in terms of the average quality of experience (QoE) reward.
Low-Rank Regularized Multi-Representation Learning for Fashion Compatibility Prediction
Peiguang JingShu YeLiqiang NieJing LiuYuting Su
Keywords:Sparse matricesMatrix decompositionClothingFeature extractionTask analysisManifoldsVisualizationImage understandingfashion compatibilitylow-rank constraintsparse representationsubspace learning
Abstracts:The currently flourishing fashion-oriented community websites and the continuous pursuit of fashion have attracted the increased research interest of the fashion analysis community. Many studies show that predicting the compatibility of fashion outfits is a nontrivial task due to the difficulty in capturing the implicit patterns affecting fashion compatibility prediction and the complex relationships presented by raw data. To address these problems, in this paper, we propose a transductive low-rank hypergraph regularizer multiple-representation learning framework (LHMRL), whereby we formulate the processes of feature representation and fashion compatibility prediction in a joint framework. Specifically, we first introduce a low-rank regularized multiple-representation learning framework, in which the lowest-rank multiple representations of samples can be learned to characterize samples from different perspectives. In this framework, we maximize the total difference among multiple representations based on Grassmann manifold theory and incorporate a common hypergraph regularizer to naturally encode the complex relationships between fashion items and an outfit. To enhance the representation ability of our model, we then develop a supervised learning term by exploiting two types of supervision information from labeled data. Experiments on a publicly available large-scale dataset demonstrate the effectiveness of our proposed model over the state-of-the-art methods.
Unsupervised Variational Video Hashing With 1D-CNN-LSTM Networks
Shuyan LiZhixiang ChenXiu LiJiwen LuJie Zhou
Keywords:Binary codesHash functionsProbabilistic logicCorrelationDecodingConvolutional codesVisualizationhashingscalable video retrievalunsupervisedvariational
Abstracts:Most existing unsupervised video hashing methods generate binary codes by using RNNs in a deterministic manner, which fails to capture the dominant latent variation of videos. In addition, RNN-based video hashing methods suffer the content forgetting of early input frames due to the sequential processing inherency of RNNs, which is detrimental to global information capturing. In this work, we propose an unsupervised variational video hashing (UVVH) method for scalable video retrieval. Our UVVH method aims to capture the salient and global information in a video. Specifically, we introduce a variational autoencoder to learn a probabilistic latent representation of the salient factors of video variations. To better exploit the global information of videos, we design a 1D-CNN-LSTM model. The 1D-CNN-LSTM model processes long frame sequences in a parallel and hierarchical way, and exploits the correlations between frames to reconstruct the frame-level features. As a consequence, the learned hash functions can produce reliable binary codes for video retrieval. We conduct extensive experiments on three widely used benchmark datasets, FCVID, ActivityNet and YFCC to validate the effectiveness of our proposed approach.
The Importance of Context When Recommending TV Content: Dataset and Algorithms
Miklas Strøm KristoffersenSven Ewan ShepstoneZheng-Hua Tan
Keywords:TVMetersRecommender systemsAutomobilesContext modelingComplexity theoryMultimedia systemsContext awarenessrecommender systemsdata collectionTV
Abstracts:Home entertainment systems feature in a variety of usage scenarios with one or more simultaneous users, for whom the complexity of choosing media to consume has increased rapidly over the last decade. Users&#x2019; decision processes are complex and highly influenced by contextual settings, but data supporting the development and evaluation of context-aware recommender systems are scarce. In this paper we present a dataset of self-reported TV consumption enriched with contextual information of viewing situations. We show how choice of genre associates with, among others, the number of present users and users&#x2019; attention levels. Furthermore, we evaluate the performance of predicting chosen genres given different configurations of contextual information, and compare the results to contextless predictions. The results suggest that including contextual features in the prediction cause notable improvements, and both temporal and social context show significant contributions.
Hot Journals