I work on semantic indexing for multimedia documents, with application to multimedia data mining and/or multimedia information retrieval. The main difficulty for reaching a reasonable performance for concept detection in multimedia documents comes from the "semantic gap" between the raw multimedia contents and the elements that make sense to human beings.
Standard approaches consist in a generic model which first extracts some low-level features and then trains a classifier based on descriptors to learn a set of concepts. In the case of multimedia documents, descriptors are extracted from each modality and are inherently different. Also, descriptors can be extracted at various abstraction and granularity levels. Hence, efficient semantic indexing models require a fusion process to take advantage of those multimodality properties. Rather than considering such three processes to design the architecture of a system, I proposed in previous work a model called "networks of operators" in which the idea is to bridge the semantic gap by small steps. This model makes the assumption that the correlation across each step is significantly higher than the correlation between the inputs and outputs of the overall system. Then, each operator aims to fill a portion of the semantic gap by going up into abstraction levels. Example of operators includes: extractors, segmentation process, classifiers (supervised or unsupervised), modules for context, and fusion. In this network, data flow are called "numcepts", they unify all pieces of information extracted at various abstraction levels and from various modalities. They include numbers (extracted from signal), and concepts (at the most semantic level). Basically, numcepts are data arrays whatever the interpretation that human can do. From this, it is possible to merge every kind of numcepts. Thus, the fusion problem is not restricted to the fusion of modalities; it becomes a problem of numcepts combination. As instances of this model, I proposed operators for context and fusion modeling, and intermediate numcepts from visual and textual modalities, called topic numcepts.
In spite of the benefit from generic approaches, which allow quite easily to index multimedia documents with a lot of concepts, the performance in terms of quality is still quite poor: Mean Average Precision in the 0.1-0.2 range for the world best systems on TRECVID evaluation campaigns. In other hand, specific approaches are designed for a particular concept and cannot be deployed to detect many concepts. This project aims to study some hybrid generic / specific approaches in order to get the best of both. Expected results would be an indexing system with "good" performance able to detect "a lot" of concepts.
Such first works have been done by some participants in the TRECVID evaluation campaigns. With such technique, it is quite easy to determine the optimal fusion method for a given concept, the optimal machine learning method for a given concept, etc. But focusing on a single operator (e.g. classifier, fusion method) is not enough. My current research aims at carry out an hybrid generic / specific indexing model allowing optimisation on lot of operators.
You can download my thesis, in French here.
| Intitulé | Cours | TD/TP |
|---|---|---|
| Indexation et Recherche Multimedia Sémantique | Partie 1 Partie 2 | TP 1 Images-dev.tgz Images-test.tgz Projet |
| Appli Mobiles | Partie 1 Partie 2 Partie 3 Partie 4 Partie 5 Examen | |
| XML | Partie 1 Partie 2 Partie 3 | TP1 TP2 TP3 Correction TP1 Correction TP2 Correction TP3 |
| Telephonie sur IP | Partie 1 | Partie 2 |


