modnlp-plugins Git

External plugins for modnlp/teccli

Status: Alpha

Brought to you by: luzs

Tree [10746a] master / Mosaic / History

HTTPS access

File	Date	Author	Commit
classes	2014-12-17	Saturnino Luz	[ce8f5e] First commit (since refactored out of modnlp/).
lib	2016-07-04	Saturnino Luz	[c149b2] Added compilation stanzas.
src	2023-05-06	Nino Luz	[10746a] Fixed Metafacet cache name encoding bug; jitter...
.gitignore	2016-07-04	Saturnino Luz	[055117] Added missing symlinks. Corrected instructions ...
ChangeLog	2014-12-17	Saturnino Luz	[ce8f5e] First commit (since refactored out of modnlp/).
Makefile	2023-02-26	Nino Luz	[d05d1c] Fixed layout issues; generalised stpwords;
README	2018-02-01	Saturnino Luz	[2158d2] Fixed makefiles; added note on MI for Mosaic
mosaic.mf	2014-12-17	Saturnino Luz	[ce8f5e] First commit (since refactored out of modnlp/).
.keystore	2014-12-17	Saturnino Luz	[ce8f5e] First commit (since refactored out of modnlp/).
tecplugin.cer	2016-07-04	Saturnino Luz	[055117] Added missing symlinks. Corrected instructions ...
tecplugin.cer.pass	2016-07-04	Saturnino Luz	[055117] Added missing symlinks. Corrected instructions ...

Read Me

Concordance Mosaic plugin. Maybe use autojar package to pack the
prefuse libraries into the plugin distribution. 


-- Note on the relation between the Mosaic scaling metric (call it M)
and (pointwise) mutual information (I).

Mutual information is an information theoretic function used to
quantify dependence between values of two distinct random variables,
say W and K, measuring the degree of dependency between them. It is
often used in corpus linguistics as a measure of collocation strength,
and is defined as follows:

I(w;k) = log [ p(w,k) / (p(w)*p(k) ) ] = log [ p(w|k) / p(w) ]      (1)

If W=w is independent of K=k then p(w,k) = p(w)*p(k) and the mutual
information is log(1)=0.

In collocation analysis we assume a probability model where W, K, etc
are multinomial random variables, ranging over vocabulary items. So,
abusing notation somewhat we could write, for instance, W='the' and
K='end' for the event that a word 'the' occurs next to the keyword
'end'. We could estimate probabilities for such events by
counting and computing relative frequencies. So, the relative
frequency of the word 'the' could be written as P(W='the') = (number
of occurences of the token 'the' in the corpus) / (total number of
tokens in the corpus) = #(the) / \sum_x #(x). The probability that
word w occurs in the context c of word k could be written as

p(w|k) = (number of occurrences of w in column c)
         / (number of tokens in column c)
       = #(w,k) / \sum_W #(W,k).                                    (2)

The Mosaic metric (for the 'global', or 'within column' display)
is defined as follows:

M(w,k) = (#(w,k) / N ) / (#(w)/\sum_x(x))
       = (#(w,k) / N ) / p(w)                                      (3)

where N is the number of types in column c. Now, from (2) we rewrite
the numerator of (3) as

 #(w,k) / N  = p(w|k) * \sum_x #(x,k) / N
            = p(w|k) * E[W|k]                                       (4)

where the expectation E[W|k] = \sum_w p(w|k) #(w,k) is the average
number of tokens per type in column c.

Replacing (4) into (3) we get

M(w,k) / E[W|k] = p(w|k) / p(w)  and therefore

I(w;k) = log (M(w,k)/E[W|k])                                        (5)

modnlp-plugins Git

External plugins for modnlp/teccli

Branches

Tree [10746a] master / Mosaic / Download Snapshot History

Read Me

Tree [10746a] master / Mosaic /

History