Load VIHI annotation data — get_vihi

Clone BLAB-private vihi_annotations repo to ~/BLAB_DATA once before using this function.

Usage

get_vihi_annotations(
  version = NULL,
  subset = c("random", "everything", "VI+TD-VI"),
  table = c("annotations", "intervals", "merged", "all"),
  include_all_tier_types = FALSE,
  allow_annotation_errors = FALSE,
  include_pi = FALSE
)

Arguments

version

version tag to checkout

subset

Which pre-defined subset of the data should be loaded?

'random' (the default) loads the annotations from the 15 randomly sampled intervals from all recordings in the corpus.
'VI+TD-VI' loads the annotations from the random and the top-5 high-volubility intervals from VI recordings and their TD matches.
'everything' loads all annotations from all tiers. Exercise caution with this option: the data will include incomplete and unchecked annotations.

table

Which table to return - annotations (the default) or intervals. If merged, returns the annotations table with the interval information merged in. Intervals without annotations won't be included. If all, returns a named list of both tables.#'

include_all_tier_types

Should all tier types be included in the output? If FALSE (the default), only tiers that are relevant to the subset are returned. For the 'random' and 'VI+TD-VI' subsets, the relevant tier types are: transcription, vcm, lex, mwu, xds. For the 'everything' subset, this parameter is ignored as all tier types are returned.

allow_annotation_errors

In case errors are found in the annotations, should the function throw an error (FALSE, the default) or add error_n columns to the annotations table? Use only as a way to inspect the errors, not as a way to ignore them.

include_pi

Should annotations marked as PI be included in the output? If FALSE (the default), they are filtered out.

Value

A table or a list of tables depending on the table parameter.

Details

The speaker TIER is identified by the participant column. Other tiers are in columns.

Notes:

Annotation are checked for errors for the standard ACLEW tiers only. Interval-level checks aren't currently checked at all.
Annotations marked as PI are included. Filter them out if you don't want them.
The transcribed utterance can be empty (”). Normally, that means that a code interval has been segmented but not annotated. But there might be other stray utterance segments like that.
(relevant for non-speaker TIERs only) Currently, there is no way to tell whether an annotation is missing because it was not segmented or because it was segmented but not yet annotated: both are represented as NA. This will change in the future: missing segment will still be NA, but missing annotation will be ”.

Examples

vitd_annotations <- get_vihi_annotations(version='0.0.0.9006-dev.5',
                                         subset='VI+TD-VI')
#> Error in run_git_command(repo, "fetch --tags --prune --prune-tags"): Expected to find the "vihi_annotations" repository at the following location: /home/runner/BLAB_DATA/vihi_annotations. Please clone it.

vitd <- get_vihi_annotations(version='0.0.0.9006-dev.5', subset='VI+TD-VI',
                             table='all')
#> Error in run_git_command(repo, "fetch --tags --prune --prune-tags"): Expected to find the "vihi_annotations" repository at the following location: /home/runner/BLAB_DATA/vihi_annotations. Please clone it.
vitd$annotations %>% head()
#> Error: object 'vitd' not found
vitd$intervals %>% head()
#> Error: object 'vitd' not found