Vision and Language: Bridging Vision and Language with Deep Learning Top↑
Speakers: Tao Mei (Microsoft Research, China), Jiebo Luo (University of Rochester, USA)


Recognition of visual content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding visual content using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge vision with natural language, which can be regarded as the ultimate goal of visual understanding. We will present recent advances in exploring the synergy of visual understanding and language processing techniques, including vision-language alignment, visual captioning and commenting, visual emotion analysis, visual question answering, visual storytelling, and as well as open issues for this emerging research area.

Speaker Bios

Tao Mei is a Senior Researcher with Microsoft Research, Beijing, China. His current research interests include multimedia content analysis, computer vision, and machine learning. Tao has shipped a dozen inventions and technologies to Microsoft products and services. He has authored or co-authored over 100 papers in journals and conferences, and holds over 18 U.S. granted patents. Tao was the recipient of 10 paper awards from prestigious multimedia journals and conferences, including IEEE Trans. on Circuits and Systems for Video Technology Best Paper Award in 2014, IEEE Trans. on Multimedia Prize Paper Award in 2013, and ACM Multimedia Best Paper Awards in 2009 and 2007, and so on. He is an Editorial Board Member of IEEE Trans. on Multimedia (TMM) and ACM Trans. on Multimedia Computing, Communications, and Applications (TOMM). He is the Program Co-chair of ACM Multimedia 2018, CBMI 2017, IEEE ICME 2015, and IEEE MMSP 2015, and the Area Chair for a dozen international conferences. He is a Senior Member of IEEE, a Distinguished Scientist of ACM, and a Fellow of IAPR.

Jiebo Luo joined the University of Rochester in Fall 2011 after over fifteen years at Kodak Research Laboratories, where he was a Senior Principal Scientist leading research and advanced development. He has been involved in numerous technical conferences, including serving as the program co-chair of ACM Multimedia 2010, IEEE CVPR 2012, and IEEE ICIP 2017. He has served on the editorial boards of the IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, Pattern Recognition, Machine Vision and Applications, and Journal of Electronic Imaging. He has authored over 300 technical papers and 90 US patents. Prof. Luo is a Fellow of the SPIE, IEEE, and IAPR.

Multi-Camera Processing, Analysis and Applications Top↑
Speakers: Andrea Cavallaro (Queen Mary University of London, UK), Juan C. SanMiguel (University Autonoma of Madrid, Spain)


Image and video processing technologies are experiencing a rapid revolution driven by the ever growing computational power of edge devices. Applications benefitting from this revolution include self-driving vehicles, multi-robot systems, wide-area surveillance, disaster management and various Internet of Things applications for smart cities and smart homes. This tutorial will offer an overview of the key features and algorithms for modern multi-camera video analytics and will cover the challenges associated to their practical implementation. Participants will learn the key elements of such multi-camera systems and to design accurate and robust algorithms that understand a variety of dynamic scenes, and adapt to different operational conditions. The tutorial will consist of theoretical explanations followed by examples using software that will be distributed to the participants.

Speaker Bios  

Andrea Cavallaro is Professor of Multimedia Signal Processing and Director of the Centre for Intelligent Sensing at Queen Mary University of London, UK. He received his Ph.D. in Electrical Engineering from the Swiss Federal Institute of Technology (EPFL), Lausanne, in 2002. He was a Research Fellow with British Telecommunications (BT) in 2004/2005 and was awarded the Royal Academy of Engineering Teaching Prize in 2007; three student paper awards on target tracking and perceptually sensitive coding at IEEE ICASSP in 2005, 2007 and 2009; and the best paper award at IEEE AVSS 2009. Prof. Cavallaro Senior Area Editor for the IEEE Transactions on Image Processing; and Associate Editor for the IEEE Transactions on Circuits and Systems for Video Technology and IEEE Multimedia. He is an elected member of the IEEE Signal Processing Society, Image, Video, and Multidimensional Signal Processing Technical Committee, and chair of its Awards committee. He served as an elected member of the IEEE Signal Processing Society, Multimedia Signal Processing Technical Committee, as Area Editor for the IEEE Signal Processing Magazine, as Associate Editor for the IEEE Transactions on Multimedia and the IEEE Transactions on Signal Processing, and as Guest Editor for seven international journals. He was General Chair for IEEE/ACM ICDSC 2009, BMVC 2009, M2SFA2 2008, SSPE 2007, and IEEE AVSS 2007. Prof. Cavallaro was Technical Program chair of IEEE AVSS 2011, the European Signal Processing Conference (EUSIPCO 2008) and of WIAMIS 2010. He has published more than 170 journal and conference papers, one monograph on Video tracking (2011, Wiley) and three edited books: Multi-camera networks (2009, Elsevier); Analysis, retrieval and delivery of multimedia content (2012, Springer); and Intelligent multimedia surveillance (2013, Springer).

Juan C. SanMiguel is associate professor (interim) in the Department of Electronic Technology and Communications at University Autonoma of Madrid, Spain. He received the M.S. degree in Electrical Engineering ("Ingeniero de Telecomunicación" degree) in 2006 and the PhD in Computer Science and Telecommunication in 2011, both at Universidad Autónoma de Madrid (Spain). Since 2005, he has been with the Video Processing and Understanding Lab (VPU-Lab) at Universidad Autonoma of Madrid as a researcher and teaching assistant. From June 2013 to June 2014, he was a postdoctoral researcher at Queen Mary University of London (UK) under a Marie Curie IAPP fellowship. In 2015 he visited the Institute of Computing Technology (ICT) of the Chinese Academy of Sciences (CAS) in Beijing, China. He also serves as a reviewer for several international Journals (IEEE TIP, IEEE CSVT, Elsevier IMAVIS) and Conferences (IEEE ICIP, IEEE WACV, IEEE AVSS). He has published more than 35 journal and conference papers. His current research interests are focused on multi-camera activity understanding and performance evaluation, oriented to target detection and tracking. He has also participated as lecturer in the summer school for video surveillance (editions 2008 and 2010), training courses for the Spanish law enforcement agency (Guardia Civil) in 2013 and 2015, and an invited conference talk at the X Conference on Science and Technology ESPE 2015 (Quito, Ecuador) and an invited seminar talk at the Chinese Academy of Sciences (CAS) in 2015 (Beijing, China). He also received the award to the best PhD thesis (School of Electrical engineering, University Autonoma of Madrid) in 2014 and the Best PhD thesis in multimedia (runner up) (given by the Electrical engineering Spanish association) in 2013.

Modern First-Order Optimization Methods for Imaging Problems Top↑
Speaker: Wotao Yin (UCLA, USA)


Optimization is playing an increasingly important role in computational imaging, where many problems reduce to large-scale optimization with structures. The huge number of variables in imaging problems often preclude the use of off-the-shelf, sophisticated algorithms such as the interior-point methods because they exceed memory limits. Scalable optimization algorithms with small memory footprints, low per-iteration costs, and excellent parallelization properties have become the popular choices. Algorithms for structure optimization have recently received significant improvements due to the revival of numerical techniques such as operator splitting, stochastic sampling, and coordinate update. Favorable structures in imaging problems can reduce a problem with a huge number of variables and data to simple, small, parallel subproblems. Developing and adapting such algorithms can potentially revolutionize the solution to many imaging problems. However, exploiting structures in large-scale optimization is not an easy task as one needs to recognize those structures to generate simple subproblems, and then combine them into fast and scalable algorithms. This is harder than applying ADMM or block coordinate descent right out of the box.

This tutorial focuses on latest first-order algorithms and the techniques of exploiting problem structures. It will provide a high-level overview of operator splitting and coordinate update methods (which include proximal, ADMM, primal-dual, and coordinate descent methods as special cases) in the context of computational imaging, along with concrete examples in image reconstruction, optical flow, segmentation, and others. Emphasis will be given to exploiting problem structures and the fundamental mechanism of building first-order algorithms with fast convergence. Some key results will be "proved" in simplified settings and through graphical illustrations. Stochastic approximating algorithms and recent nonconvex optimization results will also be included.

Speaker Bio

Wotao Yin is a professor in the Department of Mathematics of UCLA. His research interests lie in computational optimization and its applications in image processing, machine learning, and other data science problems. He received his B.S. in Mathematics from Nanjing University in 2001, and then M.S. and Ph.D. in Operations Research from Columbia University in 2003 and 2006, respectively. During 2006-2013, he was with Rice University. He won NSF CAREER award in 2008, Alfred P. Sloan Research Fellowship in 2009, Morningside Gold Medal in 2016. He invented fast algorithms for sparse optimization and has been leading the research of optimization algorithms for large-scale problems. His methods and algorithms have found very broad applications across different fields of science and engineering. Google Scholar recorded his 90 papers, out of which 27 have been cited at least 100 times, and 3 have had over 1000 citations. He also co-authored 20 open-source software packages and 2 review articles. 

Future Video Coding – Coding Tools and Developments beyond HEVC Top↑
Speakers: Jens‐Rainer Ohm and Mathias Wien (RWTH Aachen University, Germany)


While HEVC is the state-of-the-art video compression standard with profiles addressing virtually all video-related products of today, recent developments suggest significant performance improvements relative to this established technology. At the same time, the target application space evolves further towards higher picture resolution, higher dynamic range, fast motion capture, or previously unaddressed formats such as 360° video. The signal properties of this content open the door for different designs of established coding tools as well as the introduction of new algorithmic concepts which have not been applied in the context of video coding before. Specifically, the required ultra-high picture resolutions and the projection operations in the context of processing 360° video provide exciting options for new developments. This type of content further modifies the way of video consumption (enabling the use of head-mounted displays) as well as the methods of video content creation and production.

This tutorial will provide a comprehensive overview on recent developments and perspectives in the area of video coding. As a central element, the work performed in the Joint Video Exploration Team (JVET) of ITU-T SG16/Q6 (VCEG) and ISO/IEC JTC1 SC29WG11 (MPEG) is covered, as well as trends outside of the tracks of standardization bodies. The focus of the presentation is on algorithms, tools and concepts with potential for competitive future video compression technology. In this context, also the potential of methods related to perceptional models, synthesis of perceptional equivalent content, and deep learning based approaches will be discussed.

Speaker Bios

Jens-Rainer Ohm holds the chair position of the Institute of Communication Engineering at RWTH Aachen University, Germany since 2000. His research and teaching activities cover the areas of motion-compensated, stereoscopic and 3-D image processing, multimedia signal coding, transmission and content description, audio signal analysis, as well as fundamental topics of signal processing and digital communication systems. Since 1998, he participates in the work of the Moving Picture Experts Group (MPEG). He has been chairing/co-chairing various standardization activities in video coding, namely the MPEG Video Subgroup since 2002, the Joint Video Team (JVT) of MPEG and ITU-T SG 16 VCEG from 2005 to 2009, and currently, the Joint Collaborative Team on Video Coding (JCT-VC), as well as the Joint Video Exploration Team (JVET). Prof. Ohm has authored textbooks on multimedia signal processing, analysis and coding, on communication engineering and signal transmission, as well as numerous papers in the fields mentioned above.

Mathias Wien received the Diploma and Dr.-Ing. degrees from RWTH Aachen University, Germany, in 1997 and 2004, respectively. He currently works as a senior research scientist and head of administration, as well as lecturer, holding a permanent position at the Institute of Communication Engineering of RWTH Aachen University, Germany. His research interests include image and video processing, space-frequency adaptive and scalable video compression, and robust video transmission. Mathias has participated and contributed to ITU-T VCEG, ISO/IEC MPEG, the Joint Video Team, and the Joint Collaborative Team on Video Coding (JCT-VC) of VCEG and ISO/IEC MPEG in the standardization work towards AVC and HEVC. He has co-chaired and coordinated several AdHoc groups as well as tool- and core experiments. He has published the Springer textbook “High Efficiency Video Coding: Coding Tools and Specification”, which fully covers Version 1 of HEVC. An extended edition covering the subsequent versions of HEVC is in preparation. Mathias is member of the IEEE Signal Processing Society and the IEEE Circuits and Systems Society. At RWTH Aachen University, Mathias teaches the master level lecture “Video Coding: Algorithms and Specification”, among other topics. The lecture covers the state of the art in video coding including HEVC.

Distance Metric Learning for Image and Video Understanding Top↑
Speakers: Jiwen Lu (Tsinghua University, China), Ruiping Wang (Chinese Academy of Sciences, China)


Over the past decade, distance metric learning has been developed as one of the basic techniques in machine learning and successfully applied to a wide range of image and video understanding tasks showing state-of-the-art performance. In this tutorial, we will overview the trend of distance metric learning techniques and discuss how they are employed to boost the performance of various image and video understanding tasks. First, we briefly introduce the basic concept of distance metric learning, and show the key advantages and disadvantages of existing distance metric learning methods in different image and video understanding tasks. Second, we introduce some of our newly proposed distance metric learning methods from two aspects: sample-based metric learning and set-based metric learning, which are developed for different application-specific image and video understanding tasks, respectively. Lastly, we will discuss some open problems in distance metric learning to show how to further develop more advanced metric learning algorithms for image and video understanding in the future.

Speakers Bios

Jiwen Lu is currently an Associate Professor with the Department of Automation, Tsinghua University, China.  His current research interests include computer vision, pattern recognition, and machine learning. He has authored/co-authored over 140 scientific papers in these areas, where 38 of them were the IEEE Transactions papers (including 4 PAMI and 10 T-IP papers) and 20 papers are published in top-tier computer vision conferences (ICCV/CVPR/ECCV). He is an elected member of the Information Forensics and Security Technical Committee of the IEEE Signal Processing Society, an Associate Editor for Pattern Recognition Letters, Neurocomputing, and IEEE Access, a Guest Editor for five journals such as Pattern Recognition, Computer Vision and Image Understanding, and Image and Vision Computing, and a reviewer for over 40 international journals such as IEEE T-PAMI/IP/CSVT. He serves/has served as an Area Chair for ICIP 2017, VCIP 2016, ICB 2016, BTAS 2016, WACV 2016, ICME 2015 and ICB 2015, a Workshop Chair for WACV 2017 and ACCV 2016, a Special Session Chair for VCIP 2015, and a technical program committee member for over 20 international conferences such as CVPR/ICCV/ECCV/NIPS/AAAI. He was a recipient of the Best Student Paper Award from the Pattern Recognition and Machine Intelligence Association (PREMIA) of Singapore in 2012, the Top 10% Best Paper Award from 2014 IEEE International Workshop on Multimedia Signal Processing (MMSP), and the National 1000 Young Talents Plan Program in 2015, respectively. Two of his authored/co-authored conference papers were nominated as the Best Paper Award Candidate in ICME 2011 and ICME 2013. He co-organizes several workshops/competitions at some international conferences such as ICME 2017, FG 2015, ICME 2014, ACCV 2014, and IJCB 2014. He gave/will give tutorials at some international conferences including CVPR 2017, ECCV 2016, CVPR 2015, FG 2015, ACCV 2014, ICME 2014 and IJCB 2014. He is a Senior Member of the IEEE.

Ruiping Wang is an Associate Professor at the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). Prior to joining ICT in July 2012, he was a postdoctoral researcher with the Tsinghua University from July 2010 to June 2012. He also spent one year working as a Research Associate with the University of Maryland, College Park, from Nov. 2010 to Oct. 2011. He has published more than 40 papers in peer-reviewed journals and conferences, including IEEE TPAMI, TIP, TMM, PR, CVPR, ICCV, ICML, and has received the Best Student Poster Award Runner-up from IEEE CVPR 2008 for the work on Manifold-Manifold Distance. He is a Guest Editor for Pattern Recognition and serves as regular reviewer/PC member for a number of leading journals and conferences, e.g. IEEE TPAMI, TIP, TCSVT, TMM, TNNLS, IJCV, ICCV, CVPR, and ECCV. He has co-organized tutorials in ACCV 2014 and CVPR 2015, and workshop at ACCV 2016. He has given invited talks in workshops of ICME 2014, ACCV 2014 and ICCV 2015. His current research interests include video-based face recognition/retrieval, facial expression analysis, image set classification, distance metric learning, and manifold learning. He is a member of the IEEE.

Hyperspectral Image Processing and Analysis Top↑
Speaker: Jocelyn Chanussot (Grenoble Institute of Technology, France)


Hyperspectral imagery, also called imaging spectroscopy, refers to images with a large number (typically a few hundreds) of narrow and contiguous spectral bands, covering a wide range of the electromagnetic spectrum from the visible to the infrared domain. Hyperspectral data is able to provide a very fine description of the chemical components in the sensed materials and ensure their detection, discrimination and characterization.

The application of hyperspectral imagery is rapidly growing, especially in the context of space and airborne remote sensing, as well as planetary exploration and astrophysics. Additional applications include, monitoring and management of the environment, physical analysis of materials, biomedical imaging, defense and security, food safety, detection of counterfeit objects (especially in pharmacology), and precision agriculture.

Unfortunately, every rose has its thorns and the price to pay for the enhanced spectral diversity is high dimensional data. The challenge is in defining appropriate signal and image processing methods. In this tutorial, we will review some processing and analysis techniques that explicitly handle the high dimensionality of the data, addressing various tasks, including image denoising, image segmentation, hierarchical analysis, target detection, spectral unmixing, and image compression. Results will be presented on images from a variety of contexts.

Speaker Bio 

Jocelyn Chanussot (M’04–SM’04–F’12) received the M.Sc. degree in electrical engineering from the Grenoble Institute of Technology (Grenoble INP), Grenoble, France, in 1995, and the Ph.D. degree from the Université de Savoie, Annecy, France, in 1998. In 1999, he was with the Geography Imagery Perception Laboratory for the Delegation Generale de l'Armement (DGA - French National Defense Department). Since 1999, he has been with Grenoble INP, where he was an Assistant Professor from 1999 to 2005, an Associate Professor from 2005 to 2007, and is currently a Professor of signal and image processing. He is conducting his research at the Grenoble Images Speech Signals and Automatics Laboratory (GIPSA-Lab). His research interests include image analysis, multicomponent image processing, nonlinear filtering, and data fusion in remote sensing. He has been a visiting scholar at Stanford University (USA), KTH (Sweden) and NUS (Singapore). Since 2013, he is an Adjunct Professor of the University of Iceland. In 2015-2017, he is a visiting professor at the University of California, Los Angeles (UCLA).

Dr. Chanussot is the founding President of IEEE Geoscience and Remote Sensing French chapter (2007-2010) which received the 2010 IEEE GRS-S Chapter Excellence Award. He was the co-recipient of the NORSIG 2006 Best Student Paper Award, the IEEE GRSS 2011 and 2015 Symposium Best Paper Award, the IEEE GRSS 2012 Transactions Prize Paper Award and the IEEE GRSS 2013 Highest Impact Paper Award. He was a member of the IEEE Geoscience and Remote Sensing Society AdCom (2009-2010), in charge of membership development. He was the General Chair of the first IEEE GRSS Workshop on Hyperspectral Image and Signal Processing, Evolution in Remote sensing (WHISPERS). He was the Chair (2009-2011) and Co-chair of the GRS Data Fusion Technical Committee (2005-2008). He was a member of the Machine Learning for Signal Processing Technical Committee of the IEEE Signal Processing Society (2006-2008) and the Program Chair of the IEEE International Workshop on Machine Learning for Signal Processing, (2009). He was an Associate Editor for the IEEE Geoscience and Remote Sensing Letters (2005-2007) and for Pattern Recognition (2006-2008). Since 2007, he is an Associate Editor for the IEEE Transactions on Geoscience and Remote Sensing. He was the Editor-in-Chief of the IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2011-2015). In 2013, he was a Guest Editor for the Proceedings of the IEEE and in 2014 a Guest Editor for the IEEE Signal Processing Magazine. He is a Fellow of the IEEE and a member of the Institut Universitaire de France (2012-2017).

Scalable Deep Learning for Image Processing with Microsoft Cognitive Toolkit Top↑
Speakers: Cha Zhang (Microsoft Research, USA), Taifeng Wang (Microsoft Research, China)


Deep learning has become the de facto standard method in most image processing problems. In the past few years, deep learning algorithms have met and exceeded human-level performance in image recognition. Nevertheless, training deep learning networks on a large data set remains very challenging. The sheer amount of computation needed to train a convolutional neural network can take months on large data sets. Combining that with the black art of hyper-parameter tuning, the community desperately needs tools to help train deep learning networks on multiple servers with multiple GPUs. This tutorial will introduce Microsoft’s Cognitive Toolkit, also known as CNTK (https://github.com/Microsoft/CNTK), to the image processing community. Various algorithms supported by the toolkit will be presented. The benefits of CNTK in terms of speed and scalability relative to existing toolkits will also be described.

Speaker Bios

Cha Zhang is a Principal Researcher in the Advanced Technology Group at Microsoft Research. He received the B.S. and M.S. degrees from Tsinghua University, Beijing, China in 1998 and 2000, respectively, both in Electronic Engineering, and the Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University, in 2004. His current research focuses on applying various audio/image/video processing and machine learning techniques to multimedia applications. Dr. Zhang has published 3 books, more than 80 technical papers and 30+ U.S. patents. He won the best paper award at ICME 2007, the top 10% award at MMSP 2009, the best student paper award at ICME 2010, and the top 10% award at ICIP 2014. He was the Program Co-Chair for VCIP 2012, and the General Co-Chair for ICME 2016. He currently is an Associate Editor for IEEE Trans. on Circuits and Systems for Video Technology, and IEEE Trans. on Multimedia.

Taifeng Wang
 is a Lead Researcher in the Artificial Intelligence group, Microsoft Research Asia. He received his B.S  and M.S. degrees in Electronic Engineering from University of Science and Technology of China in 2003 and 2006, respectively. His research interests include large-scale machine learning, distributed system, internet advertising, and search engine technique. His research focuses on modeling users’ behavior in ads system to help the search engine to deliver better ads. Before working on ads, he had developed a large-scale graph learning platform which handles learning and mining tasks on graphs with billions of nodes. His latest research interest is to enable deep learning algorithms to scale well on distributed systems, and he actively contributes to the distributed learning algorithms in Microsoft’s Cognitive Toolkit.