To obtain applicable data for sign language translation, we design two stages of the data labelling procedure:
- aligning video clips and transcriptions.
- detecting the signer in each aligned video clip.
They are imperative as original subtitles often misalign with sign videos, and the position of the signer varies.
In the first stage, we engage Auslan experts to synchronise video $\leftrightarrow$ fingerspelling, video $\leftrightarrow$ gloss, and video $\leftrightarrow$ sentence pairs.
In the second stage, we employ Alphapose to track people in each aligned video clip. Then, we record trajectories and pose sequences along with their corresponding IDs. Considering Alphapose may suffer tracking errors sometimes, we thus invite annotators to manually check and modify the tracking results by assigning correct IDs to the signers.