Title: | Analysis of Dramatic Texts |
---|---|
Description: | Analysis of preprocessed dramatic texts, with respect to literary research. The package provides functions to analyze and visualize information about characters, stage directions, the dramatic structure and the text itself. The dramatic texts are expected to be in CSV format, which can be installed from within the package, sample texts are provided. The package and the reasoning behind it are described in Reiter et al. (2017) <doi:10.18420/in2017_119>. |
Authors: | Nils Reiter [aut, cre] (0000-0003-3193-6170), Tim Strohmayer [ctb], Janis Pagel [ctb] |
Maintainer: | Nils Reiter <[email protected]> |
License: | GPL (>= 3) |
Version: | 3.0.2 |
Built: | 2024-11-06 06:13:53 UTC |
Source: | https://github.com/quadrama/dramaanalysis |
This function expects an object of type QDCharacterStatistics and plots the specified column as a stacked bar plot.
## S3 method for class 'QDCharacterStatistics' barplot( height, col = qd.colors, column = "tokens", order = -1, labels = TRUE, top = 5, ... )
## S3 method for class 'QDCharacterStatistics' barplot( height, col = qd.colors, column = "tokens", order = -1, labels = TRUE, top = 5, ... )
height |
The object of class QDCharacterStatistics that is to be plotted |
col |
The colors to use |
column |
Which column of the character statistics should be used? |
order |
Sort the fields inversely |
labels |
Whether to add character labels into the plot |
top |
Limit the labels to the top 5 characters. Otherwise, labels will become unreadable. |
... |
All remaining options are passed to |
See barplot.default()
.
barplot.default
A list of word fields, i.e., collections of German lemmas associated with the five concepts Familie (family), Krieg (war), Liebe (love), Ratio (reason) and Religion (religion). The base dictionary is for demo purposes only, because it doesn't contain any umlaut characters.
base_dictionary
base_dictionary
A list with five entries, each of them being a character vector.
The function characterNames()
is applicable on
all tables with a character table
(that are of the class QDHasCharacter
). It can be used to reformat the character
names. The function FUN
is applied to the character name entries within
the QDDrama
object. The factor levels in the character column of x
are replaced by
the result values of FUN
.
characterNames(x, drama, FUN = stringr::str_to_title, sort = 0, ...)
characterNames(x, drama, FUN = stringr::str_to_title, sort = 0, ...)
x |
The object in which we want to transform names, needs to inherit the type |
drama |
The QDDrama object with all the information. |
FUN |
A function applied to the strings.
Defaults to |
sort |
Numeric. If set to a non-zero value, the resulting data.frame will be sorted
alphabetically
according to the drama and character name. If the value is above 0, the
sorting will be ascending, if set to a negative value, the sorting is
descending. If sort is set to 0 (the default), the order is unchanged.
The ordering can also be specified explicitly, by passing an integer vector
with as many elements as |
... |
All other arguments are ignored. |
The function returns x
, but with modified character
names.
str_to_title
data(rksp.0) ustat <- utteranceStatistics(rksp.0) ustat <- characterNames(ustat, rksp.0)
data(rksp.0) ustat <- utteranceStatistics(rksp.0) ustat <- characterNames(ustat, rksp.0)
This function extracts character statistics from a drama object.
characterStatistics( drama, normalize = FALSE, segment = c("Drama", "Act", "Scene"), filterPunctuation = FALSE )
characterStatistics( drama, normalize = FALSE, segment = c("Drama", "Act", "Scene"), filterPunctuation = FALSE )
drama |
A |
normalize |
Normalizing the individual columns |
segment |
"Drama", "Act", or "Scene". Allows calculating statistics on segments of the play |
filterPunctuation |
Whether to exclude all punctuation from token counts |
A data frame with the additional classes
QDCharacterStatistics
and QDHasCharacter
. It has following
columns and one row for each character:
tokens: The number of tokens spoken by that character
types : The number of different tokens (= types) spoken by each character
utterances: The number of utterances
utteranceLengthMean: The mean length of utterances
utteranceLengthSd: The standard deviation in utterance length
data(rksp.0) stat <- characterStatistics(rksp.0)
data(rksp.0) stat <- characterStatistics(rksp.0)
The function combine(x, y)
can be used to merge
multiple objects of the type QDDrama
into one.
combine(x, y)
combine(x, y)
x |
A |
y |
A |
A single QDDrama object that represents both plays.
data(rksp.0) data(rjmw.0) d <- combine(rjmw.0, rksp.0)
data(rksp.0) data(rjmw.0) d <- combine(rjmw.0, rksp.0)
The function configuration(...)
Creates drama configuration matrix as a QDConfiguration
object, which is also a data.frame. The S3 function as.matrix()
can be used to extract a numeric or logical matrix containing the core.
configuration( d, segment = c("Act", "Scene"), mode = c("Active", "Passive"), onlyPresence = FALSE ) ## S3 method for class 'QDConfiguration' as.matrix(x, ...)
configuration( d, segment = c("Act", "Scene"), mode = c("Active", "Passive"), onlyPresence = FALSE ) ## S3 method for class 'QDConfiguration' as.matrix(x, ...)
d |
A |
segment |
A character vector, either "Act" or "Scene". Partial matching allowed. |
mode |
Character vector, should be either "Active" or "Passive". Passive configurations express when characters are mentioned, active ones when they speak themselves. Please note that extracting passive configuration only makes sense if some form of coreference resolution has taken place on the text, either manually or automatic. If not, only very basic references (first person pronouns and proper names) are represented, which usually gives a very wrong impression. |
onlyPresence |
If TRUE, the function only records whether a character was present. If FALSE (which is the default), the function counts the number of tokens spoken (active) or referenced (passive). |
x |
An object of class QDConfiguration |
... |
All other arguments are passed to |
Drama configuration matrix as a QDConfiguration
object (of type data.frame
).
By default, we generate active matrices that are based on
the character speech. A character is present in a scene or
act, if they make an utterance.
Using the argument mode
, we can also create passive
configuration matrices. They look very similar, but are based
on who's mentioned in a scene or an act.
# Active configuration matrix data(rksp.0) cfg <- configuration(rksp.0) # Passive configuration matrix cfg <- configuration(rksp.0, mode="Passive")
# Active configuration matrix data(rksp.0) cfg <- configuration(rksp.0) # Passive configuration matrix cfg <- configuration(rksp.0, mode="Passive")
Calculates correlation of a frequency table with an outcome
list according to given method. The function currently only works for
pairwise correlation, i.e., two categories. Note that the function
keyness()
is actually better to do the same thing, and this function
should not be used anymore in this fashion.
correlationAnalysis(text.ft, categories, method = "spearman", culling = 0, ...)
correlationAnalysis(text.ft, categories, method = "spearman", culling = 0, ...)
text.ft |
A matrix, containing words in columns and characters (or plays) in rows.
This can be the result of the |
categories |
A factor or numeric vector that represents a list of categories. |
method |
The correlation method, passed on to cor() |
culling |
An integer. Words that appear in less items are removed. Defaults to 0 which doesn't remove anything. |
... |
Arguments passed to |
The function returns a data.frame with three columns: The word, it's correlation score, and the category it is correlated to. The latter is mainly for an easier use of the results.
data(rksp.0) ft <- frequencytable(rksp.0, byCharacter=TRUE) g <- factor(c("m","m","m","m","f","m","m","m","f","m","m","f","m")) rksp.0.cor <- correlationAnalysis(ft, g) # to pre-filter by the total frequency of a word ft <- frequencytable(rksp.0, byCharacter=TRUE) ft <- ft[,colSums(ft) > 5] correlationAnalysis(ft, g)
data(rksp.0) ft <- frequencytable(rksp.0, byCharacter=TRUE) g <- factor(c("m","m","m","m","f","m","m","m","f","m","m","f","m")) rksp.0.cor <- correlationAnalysis(ft, g) # to pre-filter by the total frequency of a word ft <- frequencytable(rksp.0, byCharacter=TRUE) ft <- ft[,colSums(ft) > 5] correlationAnalysis(ft, g)
rksp.0
represents the data set exported from Lessings
Emilia Galotti, rjmw.0
is the one exported from Miss Sara Sampson
(also written by Lessing). Please note that in both plays, special
characters have been removed for technical reasons. The text is German,
but all umlauts have been replaced by another character. This is only a
restriction of the pre-packaged files.
rksp.0 rjmw.0
rksp.0 rjmw.0
A list containing data.frames and data.table.
An object of class QDDrama
(inherits from list
) of length 6.
These methods retrieve
count the number of occurrences of the words in the dictionaries,
across different speakers and/or segments.
The function dictionaryStatistics()
calculates statistics for
dictionaries with multiple entries, dictionaryStatisticsSingle()
only
for a single word list.
Extract the number part from a
QDDictionaryStatistics
table as a matrix
dictionaryStatistics( drama, fields = DramaAnalysis::base_dictionary[fieldnames], fieldnames = c("Liebe"), segment = c("Drama", "Act", "Scene"), normalizeByCharacter = FALSE, normalizeByField = FALSE, byCharacter = TRUE, column = "Token.lemma", ci = TRUE ) dictionaryStatisticsSingle( drama, wordfield = c(), segment = c("Drama", "Act", "Scene"), normalizeByCharacter = FALSE, normalizeByField = FALSE, byCharacter = TRUE, fieldNormalizer = length(wordfield), column = "Token.lemma", ci = TRUE, colnames = NULL ) ## S3 method for class 'QDDictionaryStatistics' as.matrix(x, ...)
dictionaryStatistics( drama, fields = DramaAnalysis::base_dictionary[fieldnames], fieldnames = c("Liebe"), segment = c("Drama", "Act", "Scene"), normalizeByCharacter = FALSE, normalizeByField = FALSE, byCharacter = TRUE, column = "Token.lemma", ci = TRUE ) dictionaryStatisticsSingle( drama, wordfield = c(), segment = c("Drama", "Act", "Scene"), normalizeByCharacter = FALSE, normalizeByField = FALSE, byCharacter = TRUE, fieldNormalizer = length(wordfield), column = "Token.lemma", ci = TRUE, colnames = NULL ) ## S3 method for class 'QDDictionaryStatistics' as.matrix(x, ...)
drama |
A QDDrama object. |
fields |
A list of lists that contains the actual field names.
By default, we load the |
fieldnames |
A list of names for the dictionaries. |
segment |
The segment level that should be used. By default, the entire play will be used. Possible values are "Drama" (default), "Act" or "Scene". |
normalizeByCharacter |
Logical. Whether to normalize by character speech length. |
normalizeByField |
Logical. Whether to normalize by dictionary size. You usually want this. |
byCharacter |
Logical, defaults to TRUE. If false, values will be calculated for the entire segment (play, act, or scene), and not for individual characters. |
column |
The table column we apply the dictionary on. Should be either "Token.surface" or "Token.lemma", the latter is the default. |
ci |
Whether to ignore case. Defaults to TRUE, i.e., case is ignored. |
wordfield |
A character vector containing the words or lemmas
to be counted (only for |
fieldNormalizer |
Defaults to the length of the wordfield. If normalizeByField is given, the absolute numbers are divided by this number. |
colnames |
The column names to be used in the output table. |
x |
An object of the type |
... |
All other parameters are passed to |
A numeric matrix that contains the frequency with which a dictionary is present in a subset of tokens
# Check multiple dictionary entries data(rksp.0) dstat <- dictionaryStatistics(rksp.0, fieldnames=c("Krieg","Familie")) # Check a single dictionary entries data(rksp.0) fstat <- dictionaryStatisticsSingle(rksp.0, wordfield=c("der")) mat <- as.matrix(dictionaryStatistics(rksp.0, fieldnames=c("Krieg","Familie")))
# Check multiple dictionary entries data(rksp.0) dstat <- dictionaryStatistics(rksp.0, fieldnames=c("Krieg","Familie")) # Check a single dictionary entries data(rksp.0) fstat <- dictionaryStatisticsSingle(rksp.0, wordfield=c("der")) mat <- as.matrix(dictionaryStatistics(rksp.0, fieldnames=c("Krieg","Familie")))
Given a QDDrama object, this function generates a list of nicely formatted names, following the format string.
dramaNames(x, ids = NULL, formatString = "%A: %T (%DM)", orderBy = "drama")
dramaNames(x, ids = NULL, formatString = "%A: %T (%DM)", orderBy = "drama")
x |
The QDDrama object |
ids |
If specified, should be a character vector of play ids (prefixed with corpus). Then the return value only contains the plays in the vector and in the order specified. |
formatString |
A character vector. Contains special symbols that are replaced by meta data entries about the plays. The following symbols can be used: - %T: title of the play - %A: Author name - %P: GND entry of the author (if known) - %DR, - %DM: The minimal date - %L: The language - %I: The id - %C: The corpus prefix |
orderBy |
The meta data key that the final list will be ordered by |
Character vector of formatted drama names
ensureSuffix
makes certain that a character vector ends in
a given suffix
ensureSuffix(x, suffix)
ensureSuffix(x, suffix)
x |
The character vector |
suffix |
The suffix |
The input character vector with the desired suffix
The function filterByDictionary()
can be used to filter a matrix as produced by
frequencytable()
by the words in the given dictionary(/-ies).
The function frequencytable()
generates a matrix of word frequencies
by drama, act or scene and/or by character. The output of this function can be fed to stylo.
filterByDictionary( ft, fields = DramaAnalysis::base_dictionary[fieldnames], fieldnames = c("Liebe") ) frequencytable( drama, acceptedPOS = postags$de$words, column = "Token.lemma", byCharacter = FALSE, sep = "|", normalize = FALSE, sortResult = FALSE, segment = c("Drama", "Act", "Scene") )
filterByDictionary( ft, fields = DramaAnalysis::base_dictionary[fieldnames], fieldnames = c("Liebe") ) frequencytable( drama, acceptedPOS = postags$de$words, column = "Token.lemma", byCharacter = FALSE, sep = "|", normalize = FALSE, sortResult = FALSE, segment = c("Drama", "Act", "Scene") )
ft |
A matrix as produced by |
fields |
A list of lists that contains the actual field names.
By default, we load the base_dictionary (as in |
fieldnames |
A list of names for the dictionaries. |
drama |
A |
acceptedPOS |
A list of accepted pos tags. Words of all POS tags not in this list are filtered out. Specify NULL or an empty list to include all words. |
column |
The column name we should use (should be either Token.surface or Token.lemma) |
byCharacter |
Logical. Whether the count is by character or by text. |
sep |
The separation symbol that goes between drama name and character (if applicable). Defaults to the pipe symbol. |
normalize |
Whether to normalize values or not. If set to TRUE, the values are normalized by row sums. |
sortResult |
Logical. If true, the columns with the highest sum are ordered left (i.e., frequent words are visible first). If false, the columns are ordered alphabetically by column name. |
segment |
Character vector. Whether the count is by drama (default), act or scene |
Matrix of word frequencies in the format words X segments
stylo
data(rksp.0) ftable <- frequencytable(rksp.0, byCharacter = TRUE) filtered <- filterByDictionary(ftable, fieldnames=c("Krieg", "Familie")) data(rksp.0) st <- frequencytable(rksp.0)
data(rksp.0) ftable <- frequencytable(rksp.0, byCharacter = TRUE) filtered <- filterByDictionary(ftable, fieldnames=c("Krieg", "Familie")) data(rksp.0) st <- frequencytable(rksp.0)
This function can be used to filter characters from all tables that contain a character column (and are of the class QDHasCharacter).
filterCharacters( hasCharacter, drama, by = c("rank", "tokens", "name"), n = ifelse(by == "tokens", 500, ifelse(by == "rank", 10, c())) )
filterCharacters( hasCharacter, drama, by = c("rank", "tokens", "name"), n = ifelse(by == "tokens", 500, ifelse(by == "rank", 10, c())) )
hasCharacter |
The object we want to filter. |
drama |
The QDDrama object. |
by |
Character vector. Specifies the filter mechanism. |
n |
The threshold or a list of character names/ids to keep. |
The function supports three filter mechanisms: The filter by
rank
sorts the characters according to the number of tokens they speak
and keeps the top $n$ characters. The filter called tokens
keeps
all characters that speak $n$ or more tokens. The filter called name
keeps the characters that are provided by name as a vector as n
.
The filtered QDHasCharacter object
data(rjmw.0) dstat <- dictionaryStatistics(rjmw.0) filterCharacters(dstat, rjmw.0, by="tokens", n=1000)
data(rjmw.0) dstat <- dictionaryStatistics(rjmw.0) filterCharacters(dstat, rjmw.0, by="tokens", n=1000)
Function to download collection data (grouped texts) from github. Overwrites (!) the current collections.
installCollectionData( dataDirectory = getOption("qd.datadir"), branchOrCommit = "master", repository = "metadata", baseUrl = "https://github.com/quadrama/" )
installCollectionData( dataDirectory = getOption("qd.datadir"), branchOrCommit = "master", repository = "metadata", baseUrl = "https://github.com/quadrama/" )
dataDirectory |
The data directory in which collection and data files are stored |
branchOrCommit |
The git branch, commit id, or tag that we want to download |
repository |
The repository |
baseUrl |
The github user (or group) |
NULL
This function downloads pre-processed dramatic texts via http and stores them locally in your data directory
installData( dataSource = "tg", dataDirectory = getOption("qd.datadir"), downloadSource = "ims", removeZipFile = TRUE, baseUrl = "https://github.com/quadrama", remoteUrl = paste0(baseUrl, "/data_", dataSource, ".git") )
installData( dataSource = "tg", dataDirectory = getOption("qd.datadir"), downloadSource = "ims", removeZipFile = TRUE, baseUrl = "https://github.com/quadrama", remoteUrl = paste0(baseUrl, "/data_", dataSource, ".git") )
dataSource |
Currently, only "tg" (textgrid) is supported |
dataDirectory |
The directory in which the data is to be stored |
downloadSource |
No longer used. |
removeZipFile |
No longer used. |
baseUrl |
The remote repository owner (e.g., https://github.com/quadrama) |
remoteUrl |
The URL of the remote repository. |
NULL
isolateCharacterSpeech()
isolates the speeches
of individual characters and optionally saves them in separate text files.
isolateCharacterSpeech( drama, segment = c("Drama", "Act", "Scene"), minTokenCount = 0, countPunctuation = TRUE, writeToFiles = TRUE, dir = getOption("qd.datadir") )
isolateCharacterSpeech( drama, segment = c("Drama", "Act", "Scene"), minTokenCount = 0, countPunctuation = TRUE, writeToFiles = TRUE, dir = getOption("qd.datadir") )
drama |
A text (or multiple texts, as a QDDrama object) |
segment |
"Drama", "Act", or "Scene". Determines on what segment-level the speech is isolated. |
minTokenCount |
The minimal token count for a speech to be considered (default = 0) |
countPunctuation |
Whether to include punctuation in minTokenCount (default = TRUE) |
writeToFiles |
Whether to write each isolated speech into a new text file (default = TRUE) |
dir |
The directory into which the files will be written (default = data directory) |
A named list of character vectors, each corresponding to character speeches as defined by segment
data(rksp.0) isolateCharacterSpeech(rksp.0, segment="Scene", writeToFiles=FALSE)
data(rksp.0) isolateCharacterSpeech(rksp.0, segment="Scene", writeToFiles=FALSE)
Given a frequency table (with texts as rows and words as columns),
this function calculates log-likelihood and log ratio of one set of rows against the other rows.
The return value is a list containing scores for each word. If the method
is loglikelihood
, the returned scores are unsigned G2 values. To estimate the
direction of the keyness, the log ratio
is more informative. A nice introduction
into log ratio can be found here.
keyness( ft, categories = c(1, rep(2, nrow(ft) - 1)), epsilon = 1e-100, siglevel = 0.05, method = c("loglikelihood", "logratio"), minimalFrequency = 10 )
keyness( ft, categories = c(1, rep(2, nrow(ft) - 1)), epsilon = 1e-100, siglevel = 0.05, method = c("loglikelihood", "logratio"), minimalFrequency = 10 )
ft |
The frequency table |
categories |
A factor or numeric vector that represents an assignment of categories. |
epsilon |
null values are replaced by this value, in order to avoid division by zero |
siglevel |
Return only the keywords above the significance level. Set to 1 to get all words |
method |
Either "logratio" or "loglikelihood" (default) |
minimalFrequency |
Words less frequent than this value are not considered at all |
A list of keywords, sorted by their log-likelihood or log ratio value, calculated according to http://ucrel.lancs.ac.uk/llwizard.html.
data("rksp.0") ft <- frequencytable(rksp.0, byCharacter = TRUE, normalize = FALSE) # Calculate log ratio for all words genders <- factor(c("m", "m", "m", "m", "f", "m", "m", "m", "f", "m", "m", "f", "m")) keywords <- keyness(ft, method = "logratio", categories = genders, minimalFrequency = 5) # Remove words that are not significantly different keywords <- keywords[names(keywords) %in% names(keyness(ft, siglevel = 0.01))]
data("rksp.0") ft <- frequencytable(rksp.0, byCharacter = TRUE, normalize = FALSE) # Calculate log ratio for all words genders <- factor(c("m", "m", "m", "m", "f", "m", "m", "m", "f", "m", "m", "f", "m")) keywords <- keyness(ft, method = "logratio", categories = genders, minimalFrequency = 5) # Remove words that are not significantly different keywords <- keywords[names(keywords) %in% names(keyness(ft, siglevel = 0.01))]
Returns a list of all ids that are installed
loadAllInstalledIds( asDataFrame = FALSE, dataDirectory = getOption("qd.datadir") )
loadAllInstalledIds( asDataFrame = FALSE, dataDirectory = getOption("qd.datadir") )
asDataFrame |
Logical value. Controls whether the return value is a list (with colon-joined ids) or a data.frame with two columns (corpus, drama) |
dataDirectory |
The directory in which precompiled drama data is installed |
A character vector with all installed play ids
Loads a table of characters and meta data
loadCharacters( ids, defaultCollection = "tg", dataDirectory = getOption("qd.datadir") )
loadCharacters( ids, defaultCollection = "tg", dataDirectory = getOption("qd.datadir") )
ids |
a list or vector of ids |
defaultCollection |
the default collection |
dataDirectory |
the data directory |
A data.frame extracted from the CSV file about characters
This function loads one or more of the installed plays and
returns them as a QDDrama
object.
loadDrama(ids, defaultCollection = "qd")
loadDrama(ids, defaultCollection = "qd")
ids |
A vector of ids. |
defaultCollection |
If the ids do not have a collection prefix, the defaultCollection prefix is applied. |
The function returns a QDDrama
object. This is essentially a
list of data.table
s, covering the different aspects (utterances, segments,
characters, ...). If multiple ids have been supplied as arguments, the tables
contain the information of multiple plays.
# both are equivalent ## Not run: installData("test") d <- loadDrama(c("test:rksp.0", "test:rjmw.0")) d <- loadDrama(c("rksp.0", "rjmw.0"), defaultCollection = "test") ## End(Not run)
# both are equivalent ## Not run: installData("test") d <- loadDrama(c("test:rksp.0", "test:rjmw.0")) d <- loadDrama(c("rksp.0", "rjmw.0"), defaultCollection = "test") ## End(Not run)
This function parses and loads one or more dramas in raw TEI format.
loadDramaTEI(filename)
loadDramaTEI(filename)
filename |
The filename of the drama to load (or a list thereof). |
The function returns an object of class QDDrama
.
loadFields()
loads dictionaries that are available on the web as plain text files.
loadFields( fieldnames = c("Liebe", "Familie"), baseurl = paste("https://raw.githubusercontent.com/quadrama/metadata/master", ensureSuffix(directory, fileSep), sep = fileSep), directory = "fields/", fileSuffix = ".txt", fileSep = "/" )
loadFields( fieldnames = c("Liebe", "Familie"), baseurl = paste("https://raw.githubusercontent.com/quadrama/metadata/master", ensureSuffix(directory, fileSep), sep = fileSep), directory = "fields/", fileSuffix = ".txt", fileSep = "/" )
fieldnames |
A list of names for the dictionaries. It is expected that files with that name can be found below the URL. |
baseurl |
The base path delivering the dictionaries. Should end in a /, field names will be appended and fed into read.csv(). |
directory |
The last component of the base url. Useful to retrieve enriched word fields from metadata repo. |
fileSuffix |
The suffix for the dictionary files |
fileSep |
The file separator used to construct the URL Can be overwritten to load local dictionaries. |
A named list that holds the loaded dictionaries as character vectors.
Dictionary files should contain one word per line, with no comments or any other meta information. The entry name for the dictionary is given as the file name. It's therefore best if it does not contain special characters. The dictionary must be in UTF-8 encoding, and the file needs to end on .txt.
# retrieves word fields from github fields <- loadFields(fieldnames=c("Liebe", "Familie", "Krieg"))
# retrieves word fields from github fields <- loadFields(fieldnames=c("Liebe", "Familie", "Krieg"))
helper method to load meta data about dramatic texts (E.g., author, year). Does not load the texts, so it's much faster.
loadMeta(ids)
loadMeta(ids)
ids |
A vector or list of drama ids |
a data frame
Function to load a set from collection files
Can optionally set the set name as a genre in the returned table.
loadSets()
returns table of all defined collections (and the
number of plays in each).
loadSet(setName, addGenreColumn = FALSE) loadSets()
loadSet(setName, addGenreColumn = FALSE) loadSets()
setName |
A character vector. The name of the set(s) to retrieve. |
addGenreColumn |
Logical. Whether to set the Genre-column in the returned table to the set name. If set to FALSE (default), a vector is returned. In this case, association to collections is not returned. Otherwise, it's a data.frame. |
A character vector with play ids that belong to the set.
Load Text
loadText( ids, includeTokens = FALSE, defaultCollection = "tg", unifyCharacterFactors = FALSE, variant = "UtterancesWithTokens" )
loadText( ids, includeTokens = FALSE, defaultCollection = "tg", unifyCharacterFactors = FALSE, variant = "UtterancesWithTokens" )
ids |
A vector containing drama ids to be downloaded |
includeTokens |
This argument has no meaning anymore. Tokens are always included. |
defaultCollection |
The collection prefix is added if no prefix is found |
unifyCharacterFactors |
Logical value, defaults to TRUE. Controls whether columns representing characters (i.e., Speaker.* and Mentioned.*) are sharing factor levels |
variant |
The file variant to load |
a data.frame that is also of class QDHasUtteranceBE
.
This function can be used to replace corpus prefixes. If a list of play ids contains textgrid prefixes, for instance, this function can be used to map them onto GerDraCor prefixes. Please note that the function does not check whether the play actually exists in the corpus.
mapPrefix(idList, map)
mapPrefix(idList, map)
idList |
The list of ids in which we want to replace. |
map |
A list containing the old prefix as name and the new one as values. |
The function returns a list of the same length of the input list, but with replaced play prefixes.
# returns c("corpus2:play1", "corpus2:play2") mapPrefix(c("corpus1:play1", "corpus1:play2"), list(corpus1="corpus2"))
# returns c("corpus2:play1", "corpus2:play2") mapPrefix(c("corpus1:play1", "corpus1:play2"), list(corpus1="corpus2"))
newCollection()
can be used to create new collections
or add dramas to existing collection files.
newCollection( drama, name = ifelse(inherits(drama, "QDDrama"), paste(unique(drama$meta$drama)), paste(drama, collapse = "_")), writeToFile = TRUE, dir = getOption("qd.collectionDirectory"), append = TRUE )
newCollection( drama, name = ifelse(inherits(drama, "QDDrama"), paste(unique(drama$meta$drama)), paste(drama, collapse = "_")), writeToFile = TRUE, dir = getOption("qd.collectionDirectory"), append = TRUE )
drama |
A text (or multiple texts, as data.frame or data.table), or a character vector containing the drama IDs to be collected |
name |
The name of the collection and its filename (default = concatenated drama IDs) |
writeToFile |
= Whether to write the collection to a file (default = TRUE) |
dir |
The directory into which the collection file will be written (default = collection directory) |
append |
Whether to extend the collection file if it already exists. If FALSE, the file will be overwritten. (default = TRUE) |
The function returns the ids that belong to the collection as a character vector.
t <- combine(rksp.0, rjmw.0) newCollection(t, writeToFile=FALSE) newCollection(c("rksp.0", "rjmw.0"), writeToFile=FALSE) # produces identical file newCollection(c("a", "b"), name="rksp.0_rjmw.0", writeToFile=FALSE) # adds "a" and "b" to the file
t <- combine(rksp.0, rjmw.0) newCollection(t, writeToFile=FALSE) newCollection(c("rksp.0", "rjmw.0"), writeToFile=FALSE) # produces identical file newCollection(c("a", "b"), name="rksp.0_rjmw.0", writeToFile=FALSE) # adds "a" and "b" to the file
The function numberOfPlays()
determines how many
different plays are contained in a single QDDrama object.
numberOfPlays(x)
numberOfPlays(x)
x |
The QDDrama object |
An integer. The number of plays contained in the QDDrama object.
# returns 1 numberOfPlays(rksp.0) # returns 2 numberOfPlays(combine(rksp.0, rjmw.0))
# returns 1 numberOfPlays(rksp.0) # returns 2 numberOfPlays(combine(rksp.0, rjmw.0))
There are multiple ways to quantify the number of characters that are exchanged over a scene or act boundary.
hamming(drama, variant = c("Trilcke", "Hamming", "NormalizedHamming")) scenicDifference(drama, norm = length(unique(drama$text$Speaker.figure_id)))
hamming(drama, variant = c("Trilcke", "Hamming", "NormalizedHamming")) scenicDifference(drama, norm = length(unique(drama$text$Speaker.figure_id)))
drama |
The QDDrama Object |
variant |
For |
norm |
For |
A QDHamming object, which is a list of values, one for each scene change. The values indicate the (potentially) normalized number of characters that are exchanged.
data(rksp.0) dist_trilcke <- hamming(rksp.0) dist_hamming <- hamming(rksp.0, variant = "Hamming") dist_nhamming <- hamming(rksp.0, variant = "NormalizedHamming")
data(rksp.0) dist_trilcke <- hamming(rksp.0) dist_hamming <- hamming(rksp.0, variant = "Hamming") dist_nhamming <- hamming(rksp.0, variant = "NormalizedHamming")
Uses the default scatterplot function to plot the personnel exchange in each scene.
## S3 method for class 'QDHamming' plot(x, drama = NULL, xlab = "Scene", ylab = "Exchange after Scene", ...)
## S3 method for class 'QDHamming' plot(x, drama = NULL, xlab = "Scene", ylab = "Exchange after Scene", ...)
x |
A numeric vector generated from the function |
drama |
Optional QDDrama object. If present, act boundaries and correct scene labels are included in the plot. |
xlab |
A character vector that is used as x axis label. Defaults to "Scene". |
ylab |
A character vector that is used as y axis label. Defaults to "Exchange". |
... |
Parameters passed to |
See plot.default()
.
plot.default
data(rksp.0) h <- hamming(rksp.0) plot(h, drama=rksp.0)
data(rksp.0) h <- hamming(rksp.0) plot(h, drama=rksp.0)
Uses the function stripchart
to plot each utterance at their position,
in a line representing the character. The dot is marked in the middle of each utterance.
Might look weird if very long utterances are present.
## S3 method for class 'QDUtteranceStatistics' plot(x, drama = NULL, colors = qd.colors, xlab = "Time", ...)
## S3 method for class 'QDUtteranceStatistics' plot(x, drama = NULL, colors = qd.colors, xlab = "Time", ...)
x |
A table generated from the function |
drama |
Optional QDDrama object. If present, segment boundaries are extracted from it and included in the plot. |
colors |
The colors to be used |
xlab |
A character vector that is used as x axis label. Defaults to "Time". |
... |
Parameters passed to stripchart(). |
See stripchart()
.
stripchart
Generates spider-web like plot. Spider webs may look cool, but they are terrible to interpret. You should think of using a bar chart to represent the same information. You have been warned.
plotSpiderWebs( dstat, symbols = c(17, 16, 15, 4, 8), cglcol = "black", legend = TRUE, legend.cex = 0.7, legend.pos.x = "bottomright", legend.pos.y = NA, legend.horizontal = FALSE, pcol = qd.colors, ... )
plotSpiderWebs( dstat, symbols = c(17, 16, 15, 4, 8), cglcol = "black", legend = TRUE, legend.cex = 0.7, legend.pos.x = "bottomright", legend.pos.y = NA, legend.horizontal = FALSE, pcol = qd.colors, ... )
dstat |
A data frame containing data, e.g., output from dictionaryStatistics() |
symbols |
Symbols to be used in the plot |
cglcol |
The color for the spider net |
legend |
Whether to print a legend |
legend.cex |
Scaling factor for legend |
legend.pos.x |
X position of legend |
legend.pos.y |
Y position of legend |
legend.horizontal |
Whether to print legend horizontally or vertically |
pcol |
The line color(s) |
... |
Miscellaneous arguments to be given for radarchart(). |
No value is returned.
radar charts and spider web plots are dangerous, they can easily become misleading. They are in this package for historic reasons, but should not be used anymore.
data(rksp.0) fnames <- c("Krieg", "Liebe", "Familie", "Ratio","Religion") ds <- dictionaryStatistics(rksp.0, normalizeByField=TRUE, fieldnames=fnames) plotSpiderWebs(ds)
data(rksp.0) fnames <- c("Krieg", "Liebe", "Familie", "Ratio","Religion") ds <- dictionaryStatistics(rksp.0, normalizeByField=TRUE, fieldnames=fnames) plotSpiderWebs(ds)
Provides lists of groups of pos tags for various word classes.
postags
postags
An object of class list
of length 1.
This function should be called for a single text. It returns a data.frame with one row for each character in the play. The data.frame contains information about the number of scenes in which a character is actively speaking or passively mentions. Please note that the information about passive presence is derived from coreference resolved texts, which is a difficult task and not entirely reliable. The plays included in the package feature manually annotated coreferences (and thus, the presence is calculated on the basis of very well data).
presence(drama, passiveOnlyWhenNotActive = TRUE)
presence(drama, passiveOnlyWhenNotActive = TRUE)
drama |
A single drama |
passiveOnlyWhenNotActive |
Logical. If true (default), passive presence is only counted if a character is not actively present in the scene. |
QDHasCharacter, data.frame. Columns actives
, passives
and
scenes
show the
absolute number of scenes in which a character is actively/passively present, or the
total number of scenes in the play. The column presence
is calculated as
.
data(rksp.0) presence(rksp.0)
data(rksp.0) presence(rksp.0)
color scheme to be used for QuaDramA plots Taken from http://google.github.io/palette.js/, tol-rainbow, 10 colors
qd.colors
qd.colors
An object of class character
of length 10.
generates a report for a specific dramatic text
report( id = "test:rksp.0", of = file.path(getwd(), paste0(unlist(strsplit(id, ":", fixed = TRUE))[2], ".html")), type = c("Single", "Compare"), ... )
report( id = "test:rksp.0", of = file.path(getwd(), paste0(unlist(strsplit(id, ":", fixed = TRUE))[2], ".html")), type = c("Single", "Compare"), ... )
id |
The id of the text or a list of ids |
of |
The output file |
type |
The type of the report. "Single" gives a report about a single play, while "Compare" can be used to compare multiple editions of a play. Please note that the "Compare" report is still under development. |
... |
Arguments passed through to the rmarkdown document |
The return value of render
This function takes two tables and combines them. The first table is of the class QDHasUtteranceBE and contains text spans that are designated with begin and end character positions. The second table of class QDHasSegments contains information about acts and scenes in the play. This function is used internally in many other functions, but is exported because it might become useful.
segment(hasUtteranceBE, hasSegments)
segment(hasUtteranceBE, hasSegments)
hasUtteranceBE |
Table with utterances |
hasSegments |
Table with segment info |
The function returns a data.table
that has both the play
segmentation and the token data in it.
data(rksp.0) segmentedText <- segment(rksp.0$text, rksp.0$segments)
data(rksp.0) segmentedText <- segment(rksp.0$text, rksp.0$segments)
This function initializes the paths to data files.
setCollectionDirectory( collectionDirectory = file.path(getOption("qd.datadir"), "collections") ) setDirectories( dataDirectory = file.path(path.expand("~"), "QuaDramA", "Data2"), collectionDirectory = file.path(dataDirectory, "collections") ) setDataDirectory( dataDirectory = file.path(path.expand("~"), "QuaDramA", "Data2") )
setCollectionDirectory( collectionDirectory = file.path(getOption("qd.datadir"), "collections") ) setDirectories( dataDirectory = file.path(path.expand("~"), "QuaDramA", "Data2"), collectionDirectory = file.path(dataDirectory, "collections") ) setDataDirectory( dataDirectory = file.path(path.expand("~"), "QuaDramA", "Data2") )
collectionDirectory |
A path to the directory in which collections are stored. By default, the directory is called "collection" below the data directory. |
dataDirectory |
A path to the directory in which data and metadata are located. "~/QuaDramA/Data2" by default. |
The set*Directory()
functions always return NULL
.
The function split(x)
expects an object of type QDDrama
and can
be used to split a QDDrama
object that consists of multiple dramas
into a list thereof. It is the counterpart to combine(x, y)
.
## S3 method for class 'QDDrama' split(x, ...)
## S3 method for class 'QDDrama' split(x, ...)
x |
The object of class |
... |
All other arguments are ignored. |
Returns a list of individual QDDrama objects, each containing one text.
data(rksp.0) data(rjmw.0) d <- combine(rjmw.0, rksp.0) dlist <- split(d)
data(rksp.0) data(rjmw.0) d <- combine(rjmw.0, rksp.0) dlist <- split(d)
This function calculates a variant of TF-IDF.
The input is assumed to contain relative frequencies.
IDF is calculated as follows: , with
being
the total number of documents (i.e., rows) and
the number of documents
containing term
. We add one to the denominator to prevent terms that appear
in every document to become 0.
tfidf(ftable)
tfidf(ftable)
ftable |
A matrix, containing "documents" as rows and "terms" as columns. Values are assumed to be normalized by document, i.e., contain relative frequencies. |
A matrix containing TF*IDF values instead of relative frequencies.
data(rksp.0) ftable <- frequencytable(rksp.0, byCharacter=TRUE, normalize=TRUE) rksp.0.tfidf <- tfidf(ftable) mat <- matrix(c(0.10,0.2, 0, 0, 0.2, 0, 0.1, 0.2, 0.1, 0.8, 0.4, 0.9), nrow=3,ncol=4) mat2 <- tfidf(mat) print(mat2)
data(rksp.0) ftable <- frequencytable(rksp.0, byCharacter=TRUE, normalize=TRUE) rksp.0.tfidf <- tfidf(ftable) mat <- matrix(c(0.10,0.2, 0, 0, 0.2, 0, 0.1, 0.2, 0.1, 0.8, 0.4, 0.9), nrow=3,ncol=4) mat2 <- tfidf(mat) print(mat2)
This method calculates the length of each utterance, organized by character and drama.
utteranceStatistics(drama, normalizeByDramaLength = TRUE)
utteranceStatistics(drama, normalizeByDramaLength = TRUE)
drama |
The dramatic text(s) |
normalizeByDramaLength |
Logical value. If true, the resulting values will be normalized by the length of the drama. |
Returns an object of class QDUtteranceStatistics
,
which is essentially a data.frame.
data(rksp.0) ustat <- utteranceStatistics(rksp.0) boxplot(ustat$utteranceLength ~ ustat$character, col=qd.colors[1:5], las=2, frame=FALSE)
data(rksp.0) ustat <- utteranceStatistics(rksp.0) boxplot(ustat$utteranceLength ~ ustat$character, col=qd.colors[1:5], las=2, frame=FALSE)