|
Łukasz Dębowski
Quantitative Considerations on Finding
the Shortest Descriptions for Meaningful
Symbolic Sequences
924
Abstract
The notes provide elements of a new quantitative theory for
unsupervised learning from pragmatic language communication. It is
argued that the suitable quantitative inference framework free from
paradoxes should be based on minimum description length (MDL)
interpreted as a simplified algorithmic complexity rather than on
classical frequentist probability. Furthermore, it is argued that
recently observed non-extensivity of entropy in meaningful symbolic
sequences can arise if and only if unsupervised acquisition of the MDL
theories for these sequences produces infinite theories and when the
unsupervised acquisition is optimal as well. Such result shakes
rigorously the belief that a finite formal theory of natural language
could be constructed by hands of any experts. On the other hand,
unsupervised machine learning is pointed out as a feasible and the
only right way to implementing language competence into AIs. From
this perspective, a promising compression-learning algorithm by de
Marcken, its efficiency and its extensions are discussed. Important
parallels with research in cognitive science and statistical physics
are pointed out, as well. Thus, the notes may be interesting not only
for computer scientists and linguists but also for other statistical
and symbolic theorists.
Keywords :
unsupervised learning, natural language
processing, communication theory, quantitative linguistics,
nonextensive thermodynamics, measures of information, cognitive
science, formal languages.
|
|
 |
 |