User:Shirik/CollabRC/Genetics

It is becoming increasingly apparent that a true, but unconventional, genetic algorithm will need to be put in place to handle the artificial intelligence bot. A set of chromosomes which detect potential vandalism will be developed. Every evolution of the genetic model will allow for crossbreeding, mutation, and survival of the fittest. This means a few things can happen: Randomness will (hopefully) pick up newer patterns of vandal-like behavior, crossbreeding will take those newly found traits that are successful and pull them into a more optimal solution, and survival of the fittest will ensure only those traits which are useful to detection of vandalism will remain. To succesfully develop this system, potential chromosomes must be identified. Each chromosome must have a set of genes, each with a set of alleles (which, unlike conventional genetics, may have a spectrum of values instead of simply a binary representation) which can be adjusted.

Word choice - A set of words with which to identify vandal-like behavior. The number of words in this list and the contents of that list are traits which are set using the evolutionary model. The following input variables are used in a generic algorithm:
- Weighted sum - Each word is given a weight from 0 to 1. The word's frequency in a diff's "added" set is multiplied by the word's weight. All values are summed together and bounded on the interval [0, 1] which produces the final result.
- Word presence - Each word in the word list that is present at least once in the diff's "added" set counts for 1 point. The total number of points is divided by the number of words in the word list.
Change significance - Determines the likelihood of vandalism based off the number of lines that have been added or removed. Both the number of lines added and the number of lines removed are used in a generic algorithm.
Repetitiveness - Determines the likelihood of vandalism by analyzing repetitive nature of added content, a generic algorithm with the following input variables:
- Character standard deviation - Of the characters that are present, analyze the frequency of the characters and compute a standard deviation
- Word standard deviation - Of the words that are present, analyze the frequency of each word and compute a standard deviation
Anonymity - While anonymous users can be and are constructive, and good faith should be assumed, it is also true that the majority of vandalism comes from anonymous users, so it is appropriate to include this as a factor into determination of vandalism. It is assumed that the fitness function will ensure that this attribute is not weighted too heavily, which would otherwise result in all anonymous users being detected. Instead, this attribute should only result in an increased strictness for other traits if the user is anonymous. This trait only results in the value 0 or 1
Bias algorithm - The formula through which decisions by each chromosome are combined together into a final result. This is a required trait for every organism. It is represented by a computational algorithm not to exceed 20 levels deep.
Final decision - The final decision is based on the output of the bias algorithm. A simple comparison is made to two constants, a and b, bounded that a < b. Any value exceeding b results in a "high confidence vandalism" result. Any value exceeding a but not exceeding b results in a "vandalism" result. Any value not exceeding a results in a "not vandalism" result. These three results are then used in actions by the bot and in the fitness function.

Computational algorithm

For any algorithm that requires abstracted mathematics, a genetic programming method will be introduced that allows the tree to be built based off successive evolutions of the model. The model will use a tree-style abstracted algorithm that has the following operations as options: addition (a + b), subtraction (a - b), division (a / b), multiplication (a * b), power (a ^ b), arithmetic mean (a/2 + b/2), geometric mean (sqrt(ab)), LHS (a * 1), RHS (1 * b), zero (0 * 0), and one (0 * 0 + 1). The tree length will incorporate all variables as independent nodes initially combined with a constant and traverse up the tree accordingly. Consider the following representation, where "+" represents any operation, letters represent input variables, and numbers represent constants.

                    +
                   / \
                  /   \
                 /     \
                /       \
               /         \
              /           \
             +             +
            / \           / \
           /   \         /   \
          /     4       5     \
         /                     \
        +                       +
       / \                     / \
      /   \                   /   \
     /     \                 /     \
    /       \               /       \
   +         +             +         +
  / \       / \           / \       / \
 /   \     /   \         /   \     /   \
A     0   B     1       C     2   D     3

From here, a random branch can be taken during evolution steps to build upon. A taken branch can be as simple as a constant or as complex as the entire tree. One branch will be taken from each parent and then will be combined together using a random operation. The genetic operations are as follows:

Replicate - Take the exact tree from one parent
Crossbreed - Take a branch from each parent and combine using a random operation with a constant followed by a random operation.
Mutate - Take the exact tree from one parent, then randomly change one or more operation(s), constant(s), and/or variable(s)

Fitness function

The fitness function is an implementation of the concept of survival of the fittest. It is a static function used to determine which results are suitable. A perfect set of genes would result in a value of 0. Any false positives or false negatives add to the score. Only the sets that have the lowest values (the fittest) will survive and have their traits passed on to the next generation. The following values are summed:

False negatives 1 point
False positives 10 points
False confident positives 1000 points
Correct confident positives -1 point
Each computational tree depth 0.01 points

These values may need to be tuned during training as necessary.