Abstract
This dissertation models natural image and language data with data-driven methods with focus in the interpretation of the emergent representation. Cognitive development and processing learns to handle input from the surrounding environment. Similarly, data-driven methods offer a flexible way to find exploratory views of the data. Independent Component Analysis (ICA) is a proven unsupervised method especially in the field of neural signal processing. It can extract cognitively relevant source signals from seemingly garbled signal mixtures with the assumption of statistical independence. The concept is closely related to sparse coding, which is neurobiologically efficient and is a view of how sensory information is processed in the brain. In the analysis of small video segments, another statistical concept, temporal coherence, is applied and the results are compared to those of ICA. The representations learned share major characteristics with those measured from the early processing in the visual cortex. A unified model which combines sparseness, temporal coherence and topological organization is introduced. With similar methodological tools, the focus is shifted to natural language data with only minimal preprocessing in order to create language-independent methods. The meaning of words can be modeled with contextual co-occurrence information collected from a large corpus and vector space models. In contrast to classical methods utilizing second-order statistics, the ICA method can reveal the underlying sparse structure and make the representation more interpretable. In addition to validating the applied unsupervised methodology, the experimental results indicate that the parametrization of the data has a very large effect on the representation learned. With the developed analysis tools, the structure learned is matched to syntactic and semantic features at different levels. For translated sentence pairs, the result is a multilingual representation for words. The increased sparsity of the representations learned is validated by further nonlinear thresholding. The findings can be utilized to build distributional models for words which match better with semantic theories of word classes and relationships among word meanings in natural language processing tasks where more interpretability is desired.
| Translated title of the contribution | Esitysten kehkeytyminen luonnollisesta datasta |
|---|---|
| Original language | English |
| Qualification | Doctor's degree |
| Awarding Institution |
|
| Supervisors/Advisors |
|
| Publisher | |
| Print ISBNs | 978-952-60-7583-9 |
| Electronic ISBNs | 978-952-60-7582-2 |
| Publication status | Published - 2017 |
| MoE publication type | G5 Doctoral dissertation (article) |
Keywords
- lexical semantics
- vision
- language
- meaning
- computational modeling
- vector space models
- unsupervised learning
- language independence
- machine learning
Fingerprint
Dive into the research topics of 'Emergence of representations in natural data'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver