Linking tokens and metadata with the same namespace #776

amyisard · 2022-05-18T10:06:14Z

amyisard
May 18, 2022

In the corpus I am working with we have 2 participants in a signed dialogue, and they each have multiple annotation tiers. We use namespaces in the annotations e.g.

PersonA::Gloss
PersonA::Mundbild
PersonB::Gloss
PersonB::Mundbild

and also in some of the metadata e.g.

PersonA::AltersGruppe

I created the relAnnis corpus using pepper and put in links between tokens from different tiers belonging to the same participant, so the query

Gloss=/ZUG.*/ ->ident Mundbild=/.*zug.*/

will find Gloss - Mundbild pairs from both PersonA and PersonB but only where the Gloss and the Mundbild belong to the same participant.

Is there any way to do this for the metadata?

At the moment if I want to find tokens in the Gloss tier produced by participants in a certain age group I have to use the query

(PersonA:Gloss=‎/KANN.*‎/ @* PersonA:AltersGruppe=‎"31-45‎") | (PersonB:Gloss=‎/KANN.*‎/ @* PersonB:AltersGruppe=‎"31-45‎")

because

Gloss=/KANN.*/ @* AltersGruppe="31-45"

returns results where either one of the participants is in the age group, not just the one who signed the Gloss.

If I have missed that there is already a way to restrict queries to tokens or metadata which are all in the same namespace, please let me know!

thomaskrause · 2022-06-01T13:31:50Z

thomaskrause
Jun 1, 2022
Maintainer

There are several answers or possible solutions to this problem caused by the different data models involved when creating the corpus and the limitations/capabilities of these data models. Like you did with the ->ident pointing relations, you need some way of connecting the annotation nodes to the speaker metadata. I don't think it is possible with the pure relANNIS data model (which is used when converting to ANNIS with Pepper), but there might be a way of leveraging the flexibility of the graphANNIS data model with some smaller scripting.

In the graphANNIS data model, this could be achieved by adding the meta annotations to corpus/objects for each speaker and then adding PartOf relations to the specific annotation nodes. There could e.g. be three subcorpus nodes and annotation nodes are either connected to the speaker corpus nodes or if they are more generic like the timeline tokens they could be attached directly to the document subcorpus node.

          document
       /            \
   person1        person2
 /    |    \       /    \
 |    |     \     /      \
w1 -> w2 -> w3   w10 -> w11

You could then search something like Gloss=/KANN.*/ @* AltersGruppe="31-45" and only the words that are part of a speaker sub-document that actually have this metadata would be found.
Because a token/annotation node can be part of several sub-corpora and annotations can exist between any node regardless of its PartOf relations, this would still allow grouping them into documents and display them together in the result view.

The obvious problem with this approach is, that no converters exists for directly converting data to the graphANNIS data model (e.g. in the GraphAML format that can be read by the import CLI. In theory, it would be possible to use the Python API to construct such corpora on your own, but this would be a lot of work.

By using Pepper to convert the data to relANNIS, two data models are involved: Salt and the relANNIS model.
The latter one is more restrictive, so I would focus on how to achieve something like you want in this data model (in theory one could write a GraphML exporter for Pepper, so one would only be limited by the Salt data model, but such an exporter does not exist yet).

RelANNIS has the concept of documents and texts. A document can consist of several texts, and it is possible to have pointing relations between annotation nodes of any text inside the same document. That could help in your case, but I assume you have some kind of timeline and thus means the relANNIS exporter will create a single text with empty token for the timeline items. Also, in relANNIS only documents can have metadata, not texts. This means you can't use the same structure as outlined for graphANNIS above.

I thought that it might be possible to solve this by adding node annotation labels with the speaker id and than do an "identical value" (==) operator search, but you still only have one document node for several speakers and thus you can't link the annotation to the distinct sub-corpus node belonging only to this speaker.

There might be a solution that makes use of the graphANNIS data model, but does not need the implementation of a full conversion script in Python. Instead, you might convert and import the corpus with Pepper and the graphANNIS CLI. Then, you could use the Python API to add some additional annotations and nodes as a post-processing step.

For every speaker you know (probably defined by the namespace of the document node annotations), you create a new graphANNIS node with the node type "speaker". It is important that this node type is neither "node" (an annotation node) nor "corpus" (a document). If this would be "corpus", it would be treated by the ANNIS frontend as document, and it won't be able to handle that there are several documents for an annotation node. Add all relevant metadata like age group to the speaker node. Then find all annotation nodes/Gloss/token that belong to that speaker and add a PartOf relation between the annotation node an the speaker node. It is totally OK for an annotation node to have multiple incoming PartOf edges.

   document ---+
  /   |   \  \  \
 /    |    \  \  \   
 |    |     \  \  \      
w1 -> w2 -> w3 /   \
  \   |   /  w4 -> w5 
    spk1-+    \    /
               spk2

This would allow you to search for

w1 @* spk1

and

w1 @* document

If you want to distribute the post-processed corpus, you can export the modified corpus as GraphML using the graphANNIS command line. Or you just use a script that imports the corpus to a temporary data folder, post-processes it and then exports the result as GraphML.

I know that it might be frustrating to not have a simple way of solving this with relANNIS and the Pepper ecosystem. We have some ideas of a graphANNIS based Pepper to avoid the limitations of the Salt and relANNIS data model, but this is more experimental and not an actual project with estimated deadlines.

1 reply

amyisard Jun 3, 2022
Author

Thanks, that's really helpful! I don't have time at the moment but for the next release of the corpus data I will try out the post-processing option, that sounds like a good solution.

amyisard · 2022-09-22T18:47:26Z

amyisard
Sep 22, 2022
Author

I've been looking at doing this using the python interface as you suggested, and I can read in the corpus, perform queries, add nodes and edges and export the corpus without any problems.

But in order to move some metadata from one node to another I need to be able to access the values of the metadata and I can't figure out how to do that. Here is the original metadata for one transcript

  <node id="single-transcript/1420216">
        <data key="k37">corpus</data>
        <data key="k34">1420216</data>
        <data key="k1">18-30</data>
        <data key="k2">18-30</data>
        <data key="k3">31-45</data>
        <data key="k4">2011-09-02</data>
        <data key="k8">männlich</data>
        <data key="k9">weiblich</data>
        <data key="k10">männlich</data>
        <data key="k24">BER-43</data>
        <data key="k25">BER-40</data>
        <data key="k26">ber-mod-1</data>
        <data key="k29">Berlin (Berlin, Brandenburg, tw. Sachsen-Anhalt)</data>
        <data key="k30">BER</data>
        <data key="k31">Erlebnisbericht</data>
        <data key="k32">dgskorpus_ber_14</data>
        <data key="k33">1420216</data>
    </node>

I can create a new node for Person, PersonB, and PersonMod and I can find and link all the other appropriate token/Gloss etc

But then I need to move the appropriate metadata so I would end up with

    <node id="PersonANode">
        <data key="k37">person</data>
        <data key="k1">18-30</data>
        <data key="k8">männlich</data>
        <data key="k24">BER-43</data>
    </node>

   <node id="PersonBNode">
        <data key="k37">person</data>
        <data key="k2">18-30</data>
        <data key="k9">weiblich</data>
        <data key="k25">BER-40</data>
    </node>

    <node id="PersonMod">
        <data key="k37">person</data>
        <data key="k2">31-45</data>
        <data key="k9">männlich</data>
        <data key="k25">BER-mod-1</data>
    </node>


    <node id="single-transcript/1420216">
        <data key="k37">corpus</data>
        <data key="k34">1420216</data>
        <data key="k4">2011-09-02</data>
       <data key="k29">Berlin (Berlin, Brandenburg, tw. Sachsen-Anhalt)</data>
        <data key="k30">BER</data>
        <data key="k31">Erlebnisbericht</data>
        <data key="k32">dgskorpus_ber_14</data>
        <data key="k33">1420216</data>
    </node>

but I can't figure out how to access and move those individual bits of data.

I've tried doing it by hand for one transcript and it seems to give the correct query results - I'm sure I can figure out a way to copy the bits of metadata from elsewhere, but for the whole corpus it would be great to manage it all within the python interface.

2 replies

thomaskrause Sep 26, 2022
Maintainer

To move entries, I would probably read them from the original node, create the new node, add a copy of the annotations to the new node and delete the annotations from the original node. Reading values is often easiest using a frequency query but getting a subgraph for the single match and extracting the information from the (networkx) graph would also work.

Something like the following sketch of Python-code (I did not actually test it, but the overall principle should work).

from graphannis.cs import CorpusStorageManager
from graphannis.graph import GraphUpdate
from graphannis.util import node_name_from_match

with CorpusStorageManager() as cs:
    g = GraphUpdate()
    # Find all transcript document nodes
    transcripts = cs.find("mycorpus", "annis:doc")
    for transcript_match in transcripts:
        transcript_id = node_name_from_match(transcript_match[0])
        # Get the values for the meta data fields of the transcript by using
        # a frequency query. A subgraph-query would also work, but is probably
        # more complicated to get the actual values.
        freq_table = cs.frequency("mycorpus",  
            'annis:doc="{}"'.format(transcript_id), 
            "1:speaker-id,1:age-group,1:gender")
        # There should only be one row in the frequency table, because the 
        # transcription ID is unique
        speaker_id = freq_table[0].values[0]
        age_group = freq_table[0].values[1]
        gender = freq_table[0].values[2]
        # Prepare update to add these values to the new node
        person_node_id = "PersonMod"
        g.add_node(person_node_id) # TODO: add PartOf edges etc.
        g.add_node_label(person_node_id, "", "speaker-id", speaker_id)
        g.add_node_label(person_node_id, "", "age-group", age_group)
        g.add_node_label(person_node_id, "", "gender", gender)
        # Delete the labels from the old node
        g.delete_node_label(transcript_id, "", "speaker-id")
        g.delete_node_label(transcript_id, "", "age-group")
        g.delete_node_label(transcript_id, "", "gender")
    # Apply the collected graph updates
    cs.apply_update("mycorpus", g)

amyisard Sep 27, 2022
Author

Thanks for your quick reply, I've got all of that to work now. However, the data attached to the new nodes does not seem to be recognized as metadata. As you said above, the nodes for each speaker have to have a type which is not "corpus" or "node", so I've give them the type "participant. Is is the case that metadata can only be attached to a "corpus" node? Or is there something else I need to add to identify the data as metadata?

thomaskrause · 2022-10-05T07:24:51Z

thomaskrause
Oct 5, 2022
Maintainer

To show as metadata in the result view, the ANNIS 4 interface, the type must currently be corpus. But the interface expects a hierarchical structure in several places e.g. like subcorpora

mydocument <-PartOf <- ParticipantDesc <- OtherCollectionSubCorpus <- RootCorpus

As long there is a single path from each document to its root corpus I think using the type corpus for the participant nodes is fine (but of course its better to test this, there might be bugs we need to resolve).

Also, even when the display does not work for nodes of type particpant, the search with the @*` should still work. Can you test if this is the case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linking tokens and metadata with the same namespace #776

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Linking tokens and metadata with the same namespace #776

amyisard May 18, 2022

Replies: 3 comments · 3 replies

thomaskrause Jun 1, 2022 Maintainer

amyisard Jun 3, 2022 Author

amyisard Sep 22, 2022 Author

thomaskrause Sep 26, 2022 Maintainer

amyisard Sep 27, 2022 Author

thomaskrause Oct 5, 2022 Maintainer

amyisard
May 18, 2022

Replies: 3 comments 3 replies

thomaskrause
Jun 1, 2022
Maintainer

amyisard Jun 3, 2022
Author

amyisard
Sep 22, 2022
Author

thomaskrause Sep 26, 2022
Maintainer

amyisard Sep 27, 2022
Author

thomaskrause
Oct 5, 2022
Maintainer