productivity.measures {zipfR} | R Documentation |
Compute various measures of productivity and lexical richness from an observed frequency spectrum, or type-frequency list, from an observed vocabulary growth curve or from a vector of tokens.
productivity.measures(obj, measures, ...) ## S3 method for class 'tfl' productivity.measures(obj, measures, ...) ## S3 method for class 'spc' productivity.measures(obj, measures, ...) ## S3 method for class 'vgc' productivity.measures(obj, measures, ...) ## Default S3 method: productivity.measures(obj, measures, ...)
obj |
a suitable data object from which productivity measures
can be computed. Currently either a frequency spectrum
(of class |
measures |
character vector naming the productivity measures to be computed (see "Productivity Measures" below). Names may be abbreviated as long as they remain unique. If unspecified, all supported measures are computed. |
... |
additional arguments passed on to the method implementations (currently, no further arguments are recognized) |
This function computes productivity measures based on an observed frequency spectrum, type-frequency list or vocabulary growth curve. If an expected spectrum or VGC is passed, the expectations E[V], E[V_m] will simply be substituted for the sample values V, V_m in the equations. In most cases, this does not yield the expected value of the productivity measure!
Some measures can only be computed from a complete frequency spectrum. They will return NA
if obj
is an incomplete spectrum or type-frequency list, an expected spectrum or a vocabulary growth curve is passed.
Some other measures can only be computed is a sufficient number of spectrum elements is included in a vocabulary growth curve (usually at least
V_1 and V_2), and will return NA
otherwise.
Such limitations are indicated in the list of measures below (unless spectrum elements V_1 and V_2 are sufficient).
For an expected frequency spectrum or vocabulary growth curve, accuracte expectations can be computed for the measures R, C, P, TTR and V. For S, H and Hapaxes, the expecations are often reasonably good approximations (based on a normal approximation of the ratio V_m / V derived from Evert (2004b, Lemma A.8) using an (incorrect) independence assumption for V_m and V - V_m).
If obj
is a frequency spectrum, type-frequency list or token vector:
A numeric vector of the same length as measures
with the corresponding observed values of the productivity measures.
If obj
is a vocabulary growth curves:
A numeric matrix with columns corresponding to the selected productivity measures and rows corresponding to the sample sizes of the vocabulary growth curve.
The following productivity measures are currently supported:
K
:Yule's (1944) K = 10000 * (SUM(m) m^2 Vm - N) / N^2
(only for complete observed frequency spectrum)
D
:Simpson's (1949) D = SUM(m) Vm * (m / N) * ((m - 1) / (N - 1))
(only for complete observed frequency spectrum)
R
:Guiraud's (1954) R = V / √{N}
S
:Sichel's (1975) S = V2 / V, i.e. the proportion of dis legomena
H
:Honoré's (1979) H = 100 * log(N) / (1 - V1 / V), a transformation of the proportion of hapax legomena adjusted for sample size
C
:Herdan's (1964) C = log(V) / log(N)
P
:Baayen's (1991) productivity index P = V1 / N, which corresponds to the slope of the vocabulary growth curve (under random sampling assumptions)
TTR
:the type-token ratio TTR = V / N
Hapax
:the proportion of hapax legomena V1 / V
V
:the total number of types V
Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/
lnre.bootstrap
and bootstrap.confint
for parametric bootstrapping experiments,
which help to determine the true expectations and sampling distributions of all productivity measures.
## TODO