English: "The number of claim pairs of the 20 most frequent categories in both corpus versions are presented in Figure 2"
"Here, we present our corpus created based on claim revision histories collected from kialo.com.
3.1 A New Corpus based on Kialo
Kialo is a typical example of an online debate portal for collaborative argumentative discussions, where participants jointly develop complex pro/con debates on a variety of topics. The scope ranges from general topics (religion, fair trade, etc.) to very specific ones, for instance, on particular policy-making (e.g., whether wealthy countries should provide citizens with a universal basic income). Each debate consists of a set of claims and is associated with a list of related pre-defined generic categories, such as politics, ethics, education, and entertainment.
What differentiates Kialo from other portals is that it allows editing claims and tracking changes made in a discussion. All users can help improve existing claims by suggesting edits, which are then accepted or rejected by the moderator team of the debate. As every suggested change is discussed by the community, this collaborative process should lead to a continuous improvement of claim quality
and a diverse set of claims for each topic. As a result of the editing process, claims in a debate have a version history in the format of claim pairs, forming a chain where one claim is the successor of another and is considered to be of higher quality (examples found in Table 1). In addition, claim pairs may have a revision type label assigned to them via a non-mandatory free form text field, where moderators explain the reason of revision.
Base Corpus
To compile the corpus, we scraped all 1628 debates found on Kialo until June 26th, 2020, related to over 1120 categories. They contain 124,312 unique claims along with their revision histories, which comprise of 210,222 pairwise relations. The average number of revisions per claim is 1.7 and the maximum length of a revision chain is 36. 74% of all pairs have a revision type. Overall, there are 8105 unique revision type labels in the corpus. 92% of labeled claim pairs refer to three types only: Claim Clarification, Typo/Grammar Correction, and Corrected/Added Links. An overview of the distribution of revision labels is given in Table 2. We refer to the resulting corpus as ClaimRevBASE [...]
Extended Corpus
To increase the diversity of data available for training models, without actually collecting new data, we applied data augmentation.
ClaimRevBASE consists of consecutive claim version pairs, i.e., if a claim v has four versions, it will be represented by three three pairs: (v1, v2), (v2, v3), and (v3, v4), where v1 is the original claim and v4 is the latest version. We extend this data by adding
all pairs between non-consecutive versions that are inferrable transitively. Considering the previous example, this means we add (v1, v3), (v1, v4), and (v2, v4). This is based on our hypothesis that every argument version is of higher quality than its
predecessors, which we come back to below. Figure 1 illustrates the data augmentation. We call the augmented corpus ClaimRevEXT.
For this corpus, we introduce the concept of revision distance, by which we mean the number of revisions between two versions. For example, the distance between v1 and v2 would be 1, whereas the distance between v1 and v3 would be 2. The distribution of the revision distances across ClaimRevEXT is summarized in Table 2. The number of claim pairs of the 20 most frequent categories in both corpus versions are presented in Figure 2. We will restrict our view to the topics in these categories in our experiments."