. Problématique and . .. De-nos-travaux, , vol.64

. .. Choix, 65 4.1.1.1 Approches en ligne/hors ligne/en logique différée, p.67

.. .. Rappels-sur-le-filtrage-de-kalman,

M. .. , MCMCDA : association de données par

. .. Mot, 71 4.2.2.1 Mouvements sur les trajectoires, p.79

. .. Vers-le-suivi, 81 4.3.1 Intégration des modèles d'apparence

. , Analyse qualitative avec ambigu¨?tésambigu¨?tés visuelles

, Dans les chapitres précédents, les tâches relativesàrelativesà l'apprentissage de signatures audiovisuelles ontétéontété traitées indépendammentindépendammentà chaque instant t et dans un contexte mono-cible

, En effet, hormis la soustraction de larrì ere-plan, tous les outils (détection visuelle, d'activité 1.1 Architecture traditionnelle des systèmes de reconnaissance de locuteurs

]. .. Mar+97, Exemple d'une courbe DET : en abscisses le taux de Fausses Acceptations et en ordonnées le taux de Faux Rejets. Figure extraite de, p.12

. , Synoptique d'un système de ré-identification traditionnel

. .. , Descripteur SDALF : (a) images brutes, (b) partition de la silhouette segmentée, (c) histogrammes HSV, (d) MSCR, et (e) RHCP [Far+10], p.17

G. ]. , 18 1.6 ´ Echantillons issus de la base de données VIPeR, Exemple d'une courbe CMC : taux de ré-identification vs. rang r, p.19

´. Echantillons-issus-de-la-base-de-données and E. .. , , p.19

. .. , ´ Echantillons issus de la base de données i-LIDS [Pro+10], p.20

. , image extraite de la caméra 1 (b), Configuration de notre plate-forme expérimentale (a)

.. .. Synoptique-de-notre-système-d'apprentissage-d'une-signature-audiovisuelle-de-personne,

, Cha??neCha??ne de traitement pour la génération d'une signature audio, p.28

. , Exemple de sortie d'un détecteur d'activité vocale sur un fichier audio de 14 secondes, contenant deux segments de parole. Le signal, ´ echantillonnéechantillonnéà 16 kHz, a ´ eté analysé par en utilisant des trames de 16 ms

, Processus de génération des MFCC, notés c(i), depuis une trame de signal x(i), p.32

. , Réponse fréquentielle d'une banque de 10 filtres suivant l'´ echelle perceptive Mel sur l'espace fréquentiel

, Cha??neCha??ne de traitement pour la génération d'une signature vidéo, p.36

. , Exemple de détection de personne sur notre corpus : extraction de la bo??tebo??te englobante contenant la cible

]. .. Far+10, 40 2.10 Paramètres extraits de SDALF : en (a) une paire d'imagettes du même individu, en (b) les axes de symétrie et d'antisymétrie extraits, en (c) les histogrammes HSV, en (d) les MSCR et en (e) les RHCP, Génération des axes de symétrie et d'antisymétrie, et exemples de partitions de plusieurs silhouettes, p.41

. , ETHZ2 (b) et ETHZ3 (c), Courbes CMC pour chaque composante séparée de SDALF sur les jeux de données ETHZ1 (a)

, Images correspondante aux signatures vidéo des 3 personnes cibles, 43 Table des figures

, En (a) l'extraction dans le plan image de la position des pieds de la cible détectée, et en (b) la projection dans le plan image de la grille du repère caméra obtenu par calibration, vol.8, p.49

. , 50 3.4 ´ Evolution du SRMR en fonction de la distance au microphone en vue 3D en (a) et zénithale en (b), SRMR sur un signal de parolé emisàemisà plusieurs distances, en synthèse (a) et en données réelles (b)

, Indice de Proximité Audio Vidéo calculé sur tout l'espace d'acquisition. Les maxima locaux correspondent aux positions voisines du microphone, p.52

, En (a) la classification apprise par le SVM pour th=0.4, en (b) l'erreur de classification en fonction de th pour les 3 locuteurs, p.52

, En jaune les positions observations classées saillantes, et en noir le contour de la zone de saillance, Classification des positionsàpositionsà chaque position pour les 3 locuteurs

. , en (a) les paramètres, en (b) leur transformée par ACC, ACC entre un vecteur de trois paramètres et l'inverse de la distance de la source sonore pour plusieurs positions

. , ´ energie + SRMR en (b), SRMR + logV en (c) eténergieeténergie + logV + SRMR en (d), Résultat du CCA sur les données de test pour 4 combinaisons de descripteurs : ´ energie + logV en (a)

, 59 3.11 Associations des observations audio et vidéo pour 3 niveaux de bruits en a), b) et c) lorsque th = 1 et taux d'observations associées en fonction du seuil th, p.60

. , Synoptique d'un traqueur visuel multi-cibles depuis un ensemble d'images successives (en hautàhautà gauche) aux trajectoires des cibles inférées dans le plan du sol

, Illustrations des mouvements de trajectoires. Les lignes et les formes géométriques colorées représentent respectivement les trajectoires et leurs observations. Les cercles noirs représentent les fausses alarmes. Figure extraite de, p.72

. , De haut-gauchè a bas-droit : scenarii de test comportant un nombre variable de K trajectoires

. , Types d'erreurs d'association : (a) fragmentation, (b) associationàassociationà une fausse alarme, (c) changement d'identité (ID Switch)

, Résultats de l'association de données en fonction du nombre de trajectoires et (a) du nombre estimé de trajectoires, (b) du critère ICAR et (c) du critère NCA, p.77

. , De haut-gauchè a bas-droit : scenarii de test comportant 10 trajectoires, générées aléatoirement, ` a plusieurs taux de fausses alarmes par temps et par volume : ? b V =

. , Résultats de l'association de données en termes de (a) nombre estimé de trajectoires, (b) critère ICAR et (c) critère NCA

, De haut-gauchè a bas-droit, scenarii de test comportant un taux variable de détection des observations, p.80

N. .. , 80 4.10 Estimation d'un modèle de distance d'une observationàobservationà la bonne cible (courbe bleue) et d'un modèle de distance d'une observationàobservationà une mauvaise cible (courbe rouge), Résultats de l'association de données en termes de (a) nombre estimé de trajectoires, (b) ICAR et (c)

, Comparaison des résultats de l'approche MCMCDA avec modèle dynamique seul (courbes bleues) et en ajoutant un modèle d'apparence visuel (courbes rouges), p.85

, Principe d'intégration des signatures audiovisuelles dans le suivi d'une trajectoire 87

. , Résultats d'associations audiovisuelles, en terme de précision et de rappel, avec/sans suivi

M. Critère, MCMCDA+signature visuelle avec/sans signature audio (configurationsàgurationsà 1 ou 2 microphones)

M. Critère, MCMCDA avec signature audio (configurationsàtionsà 1 et 2 microphones)

. , et deux propositions de partition, correcte en (b) et une incorrecte avec changement ID en (c), Scénario avec deux cibles induisant des observations ambiguës en (a)

, Liste des tableaux

. , 8 2.1 ´ Evaluations des méthodes de VAD sur le corpus, Illustration des tâches de vérification, d'identification et de structuration

, Performances de la reconnaissance du locuteuràlocuteurà trois niveaux de bruits, p.35

. .. Performances-des-détecteurs-de-personnes-de-l'´-etat, 38 2.4 Score nAUC (normalized Area Under the Curve) pour les 3 paramètres du SDALF, ainsi que le descripteur complet, sur les trois jeux de données ETHZ1

. , Outils pour l'apprentissage des signatures audio et vidéo

. , Fréquences de modulation centrales (f c ) et bandes passantes (BP ), en Hz, du banc de filtres

. , Corrélation de Pearson entre différentes combinaisons de paramètres et l'inverse de la distance

. , Erreur Quadratique Moyenne entre la référence et les 4 configurations des paramètres

. , Statistiques sur les erreurs d'estimation de la distance

, Spécificités des stratégies SOT/MOT en ligne vs. hors ligne vs. logique différée, p.66

, Notations et illustrations desélémentsdeséléments traités en suivi multi-cibles, p.72

.. .. Opérations-sur-les-trajectoires,

. , Synthèse des résultats sur les 3 jeux de données

. , Corpus utilisé pour la construction des modèles visuels

. , Synthèse des résultats sur les 3 jeux de données

. , Vraisemblance des partitions avec/sans ID Switch

. Bibliographie,

A. Anjos, Bob : a free signal processing and machine learning toolbox for researchers, 20th ACM Conference on Multimedia Systems (ACMMM), 2012.

A. Bedagkar-gala, K. Shishir, and . Shah, A survey of approaches and trends in person re-identification, Image and Vision Computing, vol.32, pp.270-286, 2014.

A. Bhattacharyya, On a Measure of Divergence between Two Multinomial Populations, Sankhy¯ a : The Indian Journal of Statistics, pp.401-406, 1946.

C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics, 2006.

K. Bernardin and R. Stiefelhagen, Evaluating Multiple Object Tracking Performance : The CLEAR MOT Metrics, EURASIP Journal on Image and Video Processing, vol.1, p.246309, 2008.

F. Decroix, Online Audiovisual Signature Training for Person Re-identification, Proceedings of the 10th International Conference on Distributed Smart Camera. ICDSC '16, pp.62-68, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01800283

N. Dehak, Support Vector Machines versus Fast Scoring in the LowDimensional Total Variability Space for Speaker Verification, 2009.

R. Drullman, M. Joost, R. Festen, and . Plomp, Effect of reducing slow temporal modulations on speech reception, The Journal of the Acoustical Society of America, vol.95, pp.2670-2680, 1994.

M. Dikmen, Pedestrian Recognition with a Learned Metric, Proceedings of the 10th Asian Conference on Computer Vision-Volume Part IV. ACCV'10, pp.501-512, 2011.

A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society. Series B (Methodological), vol.39, pp.1-38, 1977.

B. Steven, P. Davis, and . Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, ACOUSTICS, SPEECH AND SIGNAL PROCESSING, pp.357-366, 1980.

P. Dollar, Fast Feature Pyramids for Object Detection, IEEE Trans. Pattern Anal. Mach. Intell, vol.36, issue.8, pp.1532-1545, 2014.

. Bibliographie,

N. Dalal and . Triggs, Histograms of Oriented Gradients for Human Detection, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05, vol.1, pp.886-893, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548512

A. Ess, B. Leibe, and L. Van-gool, Depth and Appearance for Mobile Scene Analysis, International Conference on Computer Vision (ICCV'07), 2007.

A. Ess, A Mobile Vision System for Robust Multi-Person Tracking, IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08), 2008.

G. Fant, Acoustic Theory of Speech Production. The Hague : Mouton, 1960.

M. Farenzena, Person re-identification by symmetry-driven accumulation of local features, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2360-2367, 2010.

T. Fortmann, Y. Bar-shalom, and M. Scheffe, Sonar tracking of multiple targets using joint probabilistic data association, IEEE Journal of Oceanic Engineering, vol.8, issue.3, pp.173-184, 1983.

P. F. Felzenszwalb, Object Detection with Discriminatively Trained PartBased Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, pp.1627-1645, 2010.

K. Fukunaga and T. E. Flick, An Optimal Global Nearest Neighbor Metric, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6, vol.3, pp.314-318, 1984.

R. A. Fisher, Statistical Methods For Research Workers. Cosmo study guides, 1925.

P. E. Forssen, Maximally Stable Colour Regions for Recognition and Matching, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.

S. Furui, Cepstral analysis technique for automatic speaker verification, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.29, pp.254-272, 1981.

T. H. Falk, C. Zheng, and W. Y. Chan, A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, pp.1766-1774, 2010.

S. Galliano, The ESTER phase II evaluation campaign for the rich transcription of French broadcast news, Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH '05, pp.1149-1152, 2005.

N. D. Gaubitch, Performance Comparison of Algorithms for Blind Reverberation Time Estimation from Speech, IWAENC 2012 ; International Workshop on Acoustic Signal Enhancement, pp.1-4, 2012.

D. Gray, S. Brennan, and H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking, IEEE International Workshop on Performance Evaluation for Tracking and Surveillance, 2007.

S. Geman and D. Geman, Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6, vol.6, pp.721-741, 1984.

, Contexts of Accommodation : Developments in Applied Sociolinguistics. Studies in Emotion and Social Interaction, 1991.

R. B. Girshick, Rich feature hierarchies for accurate object detection and semantic segmentation, 2013.

R. B. Girshick, Fast R-CNN, 2015.

N. Gheissari, T. B. Sebastian, and R. Hartley, Person Reidentification Using Spatiotemporal Appearance, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06, pp.1528-1535, 2006.

O. Hamdoun, Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences, p.2008
URL : https://hal.archives-ouvertes.fr/inria-00332032

, Second ACM/IEEE International Conference on Distributed Smart Cameras, pp.1-6, 2008.

W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, vol.57, pp.97-109, 1970.

J. Haton, Reconnaissance Automatique de la Parole Du signaì a son interprétation, UniverSciences, p.392, 2006.

H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, vol.87, pp.1738-1752, 1990.

H. Hermansky and N. Morgan, RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, vol.2, pp.578-589, 1994.

H. Hotelling, Relations Between Two Sets of Variates, pp.321-377

T. Houtgast and H. J. Steeneken, A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria, The Journal of the Acoustical Society of America, vol.77, pp.1069-1077, 1985.

D. R. Hardoon, S. Szedmak, and J. Shawe-taylor, Canonical Correlation Analysis : An Overview with Application to Learning Methods, Neural Computation, vol.16, pp.2639-2664, 2004.

R. E. Kalman, A New Approach to Linear Filtering And Prediction Problems, ASME Journal of Basic Engineering, 1960.

P. Kenny, Factor analysis simplified". In : in ICASSP, 2005.

E. Khoury, L. E. Shafey, and S. Marcel, Spear : An open source toolbox for speaker recognition based on Bob, IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2014.

M. Köstinger, Large scale metric learning from equivalence constraints, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.2288-2295, 2012.

L. F. Lamel, BREF, a Large Vocabulary Spoken Corpus for French, pp.505-508

. Bibliographie,

A. Larcher, ALIZE 3.0-open source toolkit for state-of-the-art speaker recognition, pp.2768-2772, 2013.
URL : https://hal.archives-ouvertes.fr/hal-01927586

. Kong-aik-lee, The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation, I4u Mega Fusion and Collaboration for Nist Speaker Recognition Evaluation, 2016.

Y. Lei, A novel scheme for speaker recognition using a phonetically-aware deep neural network, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1695-1699, 2014.

G. Lisanti, Person Re-Identification by Iterative Re-Weighted Sparse Ranking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, pp.1629-1642, 2015.

L. Lu, H. Jiang, and H. Zhang, A Robust Audio Classification and Segmentation Method, Proceedings of the Ninth ACM International Conference on Multimedia. MULTIMEDIA '01, pp.203-211, 2001.

A. Larcher, K. A. Lee, and S. Meignier, An extensible speaker identification sidekit in Python, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5095-5099, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01433157

G. Lathoud, J. Odobez, and D. Gatica-perez, Revised Selected Papers. Sous la dir. de Samy Bengio et Hervé Bourlard, Machine Learning for Multimodal Interaction : First International Workshop, MLMI, pp.182-195, 2004.

N. , Re-Identification in the Function Space of Feature Warps, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, pp.1656-1669, 2015.

A. Martin, The DET curve in assessment of detection task performance, pp.1895-1898, 1997.

N. Metropolis, Equation of State Calculations by Fast Computing Machines, The Journal of Chemical Physics, vol.21, pp.1087-1092, 1953.

J. Munkres, ALGORITHMS FOR THE ASSIGNMENT AND TRANSPORTATION PROBLEMS, 1957.

E. Nemer, A. Rafik, . Goubran, A. Samy, and . Mahmoud, Robust voice activity detection using higher-order statistics in the LPC residual domain, IEEE Trans. Speech and Audio Processing, vol.9, pp.217-231, 2001.

S. Oh, Markov Chain Monte Carlo Data Association for Multi-Target Tracking Univ. Rapp. tech, 2008.

S. J. Prince and J. H. Elder, Probabilistic Linear Discriminant Analysis for Inferences About Identity, 2007 IEEE 11th International Conference on Computer Vision, pp.1-8, 2007.

D. Povey, The Kaldi Speech Recognition Toolkit, IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Catalog No. : CFP11SRW-USB, 2011.

B. Prosser, Person Re-Identification by Support Vector Ranking, Proc. BMVC, pp.21-22, 2010.

J. Pelecanos and . Sridha-sridharan, Feature Warping for Robust Speaker Verification, 2001.

J. Pinquier, C. Sénac, and R. André-obrecht, Robust speech / music classification in audio documents, Proc. ICSLP'02, 2002.
URL : https://hal.archives-ouvertes.fr/hal-01695739

R. Ratnam, Blind estimation of reverberation time, The Journal of the Acoustical Society of America, vol.114, p.2877, 2003.

D. W-robinson-et-r-s-dadson, A re-determination of the equal-loudness relations for pure tones, British Journal of Applied Physics, vol.7, issue.5, p.166, 1956.

D. Reid, An algorithm for tracking multiple targets, IEEE Transactions on Automatic Control, vol.24, pp.843-854, 1979.

. Shaoqing-ren, Faster R-CNN : Towards Real-Time Object Detection with Region Proposal Networks, 2015.

D. A. Reynolds, Speaker Identification and Verification Using Gaussian Mixture Speaker Models, Speech Commun, vol.17, issue.2, pp.91-108, 1995.

B. D. Rao and K. V. Hari, Performance analysis of Root-Music, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.37, pp.1939-1949, 1989.

L. Rabiner, . Biing-hwang, and . Juang, Fundamentals of Speech Recognition, 1993.

R. Ratnam, D. L. Jones, and W. D. O'brien, Fast algorithms for blind estimation of reverberation time, IEEE Signal Processing Letters, vol.11, issue.6, pp.537-540, 2004.

R. Roy, A. Paulraj, and T. Kailath, ESPRIT-A subspace rotation approach to estimation of parameters of cisoids in noise, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.34, pp.1340-1342, 1986.

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models, Digit. Signal Process, vol.10, issue.1, pp.19-41, 2000.

M. E. Sargin, Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis, IEEE Transactions on Multimedia 9, vol.7, pp.1396-1403, 2007.

R. Satta, Appearance Descriptors for Person Re-identification : a Comprehensive Review, 2013.

F. De-saussure, Cours de linguistique générale, 1916.

M. R. Schroeder, New Method of Measuring Reverberation Time, The Journal of the Acoustical Society of America, vol.37, pp.409-412, 1965.

. Bibliographie,

R. Schmidt, Multiple emitter location and signal parameter estimation, IEEE Transactions on Antennas and Propagation, vol.34, issue.3, pp.276-280, 1986.

W. R. Schwartz and L. S. Davis, Learning Discriminative Appearance-Based Models Using Partial Least Squares, 2009 XXII Brazilian Symposium on Computer Graphics and Image Processing, pp.322-329, 2009.

J. K. Uhlmann-simon and J. Julier, New extension of the Kalman filter to nonlinear systems, 1997.

J. Sohn, N. S. Kim, and W. Sung, A statistical model-based voice activity detection, IEEE Signal Processing Letters, vol.6, pp.1-3, 1999.

C. Sanderson, C. Brian, and . Lovell, Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference, Advances in Biometrics : Third International Conference, pp.199-208, 2009.

A. Saxena, Y. Andrew, and . Ng, Learning Sound Location from a Single Microphone, Proceedings of the 2009 IEEE International Conference on Robotics and Automation. ICRA'09, pp.4310-4315, 2009.

. Seyed-omid, M. Sadjadi, L. Slaney, and . Heck, MSR Identity Toolbox v1.0 : A MATLAB Toolbox for Speaker Recognition Research. Rapp. tech, 2013.

S. Tong, H. Gu, and K. Yu, A comparative study of robustness of deep learning approaches for VAD, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5695-5699, 2016.

, World urbanization prospects, 2014.

O. Viikki and K. Laurila, Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition, Speech Commun, vol.25, pp.133-147

J. Y. Wen, E. A. Habets, and P. A. Naylor, Blind estimation of reverberation time based on the distribution of signal decay rates, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.329-332, 2008.

Q. Kilian, L. K. Weinberger, and . Saul, Distance Metric Learning for Large Margin Nearest Neighbor Classification, J. Mach. Learn. Res, vol.10, pp.207-244, 2009.

J. Wu and X. L. Zhang, Efficient Multiple Kernel Support Vector Machine Based Voice Activity Detection, IEEE Signal Processing Letters, vol.18, pp.466-469, 2011.

W. S. Zheng, S. Gong, and T. Xiang, Person re-identification by probabilistic relative distance comparison, CVPR 2011, pp.649-656, 2011.

S. Zhang, How Far are We from Solving Pedestrian Detection ?, In : 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1259-1267, 2016.

Z. Zivkovic, Improved adaptive Gaussian mixture model for background subtraction, Proceedings of the 17th International Conference on Pattern Recognition, vol.2, pp.28-31, 2004.

L. Zheng, Y. Yang, and A. G. Hauptmann, Person Re-identification : Past, Present and Future, 2016.