Semantic Paths Specification

Version 2.2 (Previous versions)

DISCLAIMER: This documentation is a collection of theoretical documents developed in the context of a pilot project where testing and implementation never occurred. As such, although it is publically released in both French and English, it does not constitute an official publication.

Dates

Created Date: 2021-10-15

Last Update: 2022-01-27

Abstract

The Semantic Paths Specification presents the semantic links between CIDOC CRM entities that form the patterns presented in the Target Model Specification and intended to answer use cases relevant to the project.

Foreword

The Semantic Paths Specification (SPS) provides a granular view of the DOPHEDA model and is intended for those who want to develop a mapping process or SPARQL queries. It is meant to be used in conjunction with the Target Model Specification (TMS), which describes the relations between nodes through a set of patterns, although at the moment only entry nodes are documented.

Each node is defined by a precise scope note detailing the data values it can accommodate, as well as by its generated bond, semantic valuation, and controlled list/terms. The context of each node can be explored through its dependencies, its related specified qualifier node·s, its full path, its visualized position in the Target Model Specification views, and its value origins. In addition to relevant comments and potential errors, the SPS also lists typical and edge cases that exemplify how providers might document information with regards to other fields. References for each node are also included.

Glossary

Entry Node

Node where a value is expected (string, integer, date or in some cases URI) to be extracted from a provider’s dataset corresponding field. All data (i.e. all the explicit data values derived from the source data structure (checklist)) populating the generated semantic data network will be stored in these nodes. In this document, the instance of the class and its label are, in most cases, presented as a single node. However, URIs directly become instances of their related class whilst literals are attached through a property to a generated instance of this related class. More information about this distinction can be found in the semantic valuation field.

Generated Node

Automatically generated node that represent parts of the full semantic relationship between entry nodes. These are automatically populated with URIs (instances and classes) and labels that are generated by scripts and are inferred from but not present in the source datasets. From a data provider’s perspective, such nodes are useful to understand the purpose of the entities that will be automatically generated by the CIDOC CRM mapping process. Most generated nodes are defined in the CIDOC CRM specification and are thus not addressed here. One exception to this rule is the documentation of nodes used to categorize other generated nodes by designating them as being “X”. Such nodes are, hereafter, defined as specified qualifier nodes.

Specified Qualifier Node (SQN)

Generated node that contains specific vocabulary terms; its function is to categorize another instance node using a singular URI. This function is a common occurrence in the model, supporting the standardization of data and its retrievability by designating what the categorized instance content conceptually pertains to. If this categorized instance is also a type, the specified qualifier node is called a metatype. The use of specified qualifier nodes is recurring in the model as a way to better classify information, thus facilitating queries by federating entities under common types.

In addition to categorization purposes, specified qualifier nodes are sometimes descriptively used to identify which property is involved in an attribute assignment pattern or to represent the semantic information pertaining to CRMpc instances.

The singular URI used for categorization will be selected from a controlled vocabulary or thesaurus that is particular to the DOPHEDA project.

Node

A node is a meaningful point of interaction in the model’s semantic network. It contains information and connects non-hierarchically to other nodes in order to contextualise and link information. Nodes (in the context of this project) can either be entry nodes or generated nodes.

Reuse of CIDOC CRM or one of its extension’s elements is indicated by the mention of their corresponding code (e.g. E21, P1, F52, R64, etc.). Of these, only elements mobilized by CHIN’s Target Model Specification are presented here; more information on CIDOC CRM can be found at the following address: http://www.cidoc-crm.org/.

The absence of a code indicates that the node has been created by CHIN (this is mostly the case for entry nodes, which hold values submitted by providers and converted to semantic standards using the model).

Documentation Structure

Scope Description of the data values accommodated by the node; this includes both the formal data type expected and the semantic meaning of the node within a knowledge graph (formally indicated by the full semantic path).
Generated Bond·s Connection of the node to the corresponding class; this connection is automatically brought about by the bonds’ respective association to a common binding entity.
Dependency·ies List of nodes that must be filled if the provider is using the current node in order to run the mapping properly. More than one node might be required and generated nodes might be stated. All nodes must be used as they are presented in the Semantic Paths Specification, with the exception of the class E39_Actor, where one of its sub-classes, either E21_Person or E74_Group, should be used.

This field displays only the nodes that have to be completed if the current node is documented; it does not include all the nodes that could bring more meaning to the current one. In order to do so, the user should explore the associated Target Model Specification view·s.
Related SQN·s This field displays the related specified qualifier node (SQN) that is essential to retrieve the present node. The path to the SQN is included in the full path of the present node and is enclosed by parentheses.
Full Path This field displays the classes and properties that will be generated when a value is found in the described node. The following convention is used: Class -> Property -> Class (-> Property -> Specified Qualifier Node) -> Property -> Entry Node (-> Property -> Specified Qualifier Node). For sake of readability, the rdfs:label properties are not shown. All classes must be used as they are presented in the Semantic Paths Specification, with the exception of the class crm:E39\_Actor, which indicates that its sub-class, either crm:E21_Person or crm:E74_Group, should be used when running a query. The starting point of the full path is always the focus class of the Target Model Specification’s facet (i.e. Actors or Objects).
Target Model Specification View·s This field lists the Target Model Specification diagrams in which the node is used; these diagrams document patterns that can be used to modellize targeted situations customarily documented by heritage data providers.
Semantic Valuation Indicators of the interlinked reusability of the values provided for the node according to the following tiers:

Low: non-standardized data that must be handled using the Messy Data pattern and will thus be semantically inexpressive (i.e. no new knowledge will be inferred from it through the model), but will remain accessible nonetheless.

Medium: standardized data that requires further normalization in order to enable semantic expressiveness. After such cleaning, new knowledge can be inferred from the data through the model, but the automation of the cleaning process induces a degree of uncertainty in the inferences.

High: standardized and cleaned data that will be semantically expressive once converted and will logically generate new information according to the model (thus creating an easily enhanceable dataset).

Each node will have different tier thresholds that will be documented per the following convention: “Tier Level: Threshold Elements”.

Data that does not comply with the lowest semantic valuation level will not be accommodated by CHIN’s model. The Accepted Value Type·s will be indicated with each tier level except the first one. The accepted value types are mentioned in the medium and high tiers to indicate what kind of value must be used. The Low tier does not have an Accepted Value Type·s field since anything is acceptable at this threshold.
Typical Case·s Contextualized example·s of values that would be considered to conform to the node’s scope note.
Edge Case·s Contextualized example·s of values that would be considered to be at the limits of the parameters of the scope note.
Value Origin·s Technical source·s of the value in the pipeline; this provenance will be indicated using the following convention: “Tool: Specific Field Label: Provider Data” (e.g. “Actors Checklist: Actor ID: Provider Data”).
Controlled List/Term Link to the controlled list of terms (for entry nodes that rely on external vocabularies) or single term (for specified qualifier nodes) the recorded value must be reconciled with (CHIN will be in charge of this reconciliation process). If the mentioned vocabulary is mandatory, it will be indicated using the following convention “Vocabulary: Recommendation”.
Potential Error·s List of potential errors that might arise from the functional use of the node when implemented in the pipeline.
Comment·s Relevant notes, explanations, and annotations to the node.
Reference·s Bibliographical citations relevant to the node and/or used to define its scope or other descriptive fields. Full references (using Chicago Style) are provided in the Bibliography.