Menu

Tree [10746a] master / Mosaic /
 History

HTTPS access


File Date Author Commit
 classes 2014-12-17 Saturnino Luz Saturnino Luz [ce8f5e] First commit (since refactored out of modnlp/).
 lib 2016-07-04 Saturnino Luz Saturnino Luz [c149b2] Added compilation stanzas.
 src 2023-05-06 Nino Luz Nino Luz [10746a] Fixed Metafacet cache name encoding bug; jitter...
 .gitignore 2016-07-04 Saturnino Luz Saturnino Luz [055117] Added missing symlinks. Corrected instructions ...
 ChangeLog 2014-12-17 Saturnino Luz Saturnino Luz [ce8f5e] First commit (since refactored out of modnlp/).
 Makefile 2023-02-26 Nino Luz Nino Luz [d05d1c] Fixed layout issues; generalised stpwords;
 README 2018-02-01 Saturnino Luz Saturnino Luz [2158d2] Fixed makefiles; added note on MI for Mosaic
 mosaic.mf 2014-12-17 Saturnino Luz Saturnino Luz [ce8f5e] First commit (since refactored out of modnlp/).
 .keystore 2014-12-17 Saturnino Luz Saturnino Luz [ce8f5e] First commit (since refactored out of modnlp/).
 tecplugin.cer 2016-07-04 Saturnino Luz Saturnino Luz [055117] Added missing symlinks. Corrected instructions ...
 tecplugin.cer.pass 2016-07-04 Saturnino Luz Saturnino Luz [055117] Added missing symlinks. Corrected instructions ...

Read Me

Concordance Mosaic plugin. Maybe use autojar package to pack the
prefuse libraries into the plugin distribution. 


-- Note on the relation between the Mosaic scaling metric (call it M)
and (pointwise) mutual information (I).

Mutual information is an information theoretic function used to
quantify dependence between values of two distinct random variables,
say W and K, measuring the degree of dependency between them. It is
often used in corpus linguistics as a measure of collocation strength,
and is defined as follows:

I(w;k) = log [ p(w,k) / (p(w)*p(k) ) ] = log [ p(w|k) / p(w) ]      (1)

If W=w is independent of K=k then p(w,k) = p(w)*p(k) and the mutual
information is log(1)=0.

In collocation analysis we assume a probability model where W, K, etc
are multinomial random variables, ranging over vocabulary items. So,
abusing notation somewhat we could write, for instance, W='the' and
K='end' for the event that a word 'the' occurs next to the keyword
'end'. We could estimate probabilities for such events by
counting and computing relative frequencies. So, the relative
frequency of the word 'the' could be written as P(W='the') = (number
of occurences of the token 'the' in the corpus) / (total number of
tokens in the corpus) = #(the) / \sum_x #(x). The probability that
word w occurs in the context c of word k could be written as

p(w|k) = (number of occurrences of w in column c)
         / (number of tokens in column c)
       = #(w,k) / \sum_W #(W,k).                                    (2)

The Mosaic metric (for the 'global', or 'within column' display)
is defined as follows:

M(w,k) = (#(w,k) / N ) / (#(w)/\sum_x(x))
       = (#(w,k) / N ) / p(w)                                      (3)

where N is the number of types in column c. Now, from (2) we rewrite
the numerator of (3) as

 #(w,k) / N  = p(w|k) * \sum_x #(x,k) / N
            = p(w|k) * E[W|k]                                       (4)

where the expectation E[W|k] = \sum_w p(w|k) #(w,k) is the average
number of tokens per type in column c.

Replacing (4) into (3) we get

M(w,k) / E[W|k] = p(w|k) / p(w)  and therefore

I(w;k) = log (M(w,k)/E[W|k])                                        (5)


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.