论坛首页 Java企业应用论坛

字符串变异的演示

浏览 1888 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2008-03-24  
引用
Programming: String Mutation

Richard Dawkins, in his book The Blind Watchmaker, explains his much-debated experiment known as the weasel program. This algorithm is designed to show that random mutations in DNA can create complex life forms through natural selection, rather than the will of an intelligent designer. His experiment took a sample string of gibberish and repeatedly mutated, or slightly changed, the string as if introducing an error in DNA duplication from one generation to another. The program had a "target" string in mind -- a line from Hamlet, "Methinks it is like a weasel" -- and continued until the random mutations added up such that the original string was the same as this target string. By running his program and finding that any line of gibberish can become Shakespeare through incremental changes, Dawkins suggested that natural selection can account for all the complexity of life.

Naturally, this is a controversial experiment, and an over-simplification of evolution. One criticism is that it is designed to run toward a target string; at every generation, a string's "fitness" to reproduce is determined by its similarity to the target. Could the target not be determined by an intelligent designer? And is an organism's fitness to survive proportional in such small increments to its genetic similarity with a better-adapted mutant? Whatever the answer, Dawkins probably did not anticipate that his experiment was an early example of what has become a common technique in machine learning: genetic algorithms. GAs can apply the same strategy effectively to complicated problems in artifical intelligence: come up with a model, make a lot of "daughter" models that are slightly different, choose the daughter that works a little bit better than the original model, and make her the new mother for the next generation. Lather, rinse, repeat.

For HW2 programming we're going to try our hand at genetic algorithms, using the same string-to-string transformation goal. Specifically, the steps of your program should be:
Take from the user, on the command line, two strings: an input string to start from, and a target string toward which to mutate. (Hint: when testing on CUNIX, you can use quotation marks to pass command line arguments with spaces. See example output.)
Model the input word as a linked list of letters (Strings of length 1), using Java's Linked List class.
Create a large set of mutated "daughter" strings from the input string, each of which is different in a small way:
Insertion. For each of 27 letters (including space), and each of N+1 positions in the input string, insert the letter into the word at the position. Use the add() method of LinkedList that takes an index to quickly insert the letter at the correct position. For a string of N letters, this method will produce 27 * (N+1) daughter strings. For example HELLO becomes AHELLO, BHELLO, etc., until HELLOZ (not counting the space character).
Deletion. For each letter in the input string, delete the letter. Use the remove() method of LinkedList() to do this quickly. For a string of N letters, this method will produce N daughter strings. For example, HELLO becomes ELLO, HLLO, HELO (twice) and HELL.
Choose the daughter string that is closest to the target string. To do this, use an algorithm to find each daughter string's fitness -- a score between 0 and 1 -- and find the daughter string with the highest score.
Fitness should be calculated as the length of the longest common substring of the target string found in the input string, divided by the length of the target string. For example, HPPLE has a score of .8 compared to APPLE, since the longest substring of APPLE found in HPPLE is 4 letters long, which is .8 of APPLE's length. Additionally, penalize the score by 0.1 for each extraneous letter the input string has over the length of the target string. For example. HAPPLE has a score of .9 compared to APPLE, because it is 1 letter longer, and otherwise has the complete substring APPLE. Note that this makes it possible for a candidate word to have a negative score.
Print out the generation number, the winning daughter string and her score.
Repeat the above from step 2, making the winning daughter the new input string, until the daughter has reached the target string. You can use either iteration or recursion to repeat appropriately. Then, terminate the program.
Here are some hints and guidelines for your program:
Let's be a little more formal about object-oriented design in HW2. Set up a class called Word which will represent a single word. The class should:
Translate an input string into a LinkedList in the constructor. It should take a String as an argument in the constructor, have a private LinkedList as a variable, and parse the string into the LinkedList (using the substring() method of String).
Have a method, mutate, which returns an array of all daughter Words to be sired by the target string. That is, each daughter should be a new Word.
Have a method makeCopy() that duplicates itself into an identical Word. Your mutate() function will use this to create many copies of itself and then change each copy.
Have a method toString() that creates a normal string from the LinkedList and returns it. (Hint: use a StringBuffer.)
Have one other class, MutationSim, that takes the input and target strings from the user, creates a Word from the target, mutates it, calculates fitness, and repeats as described above.
Learn about the String methods substring(), toUpperCase() and indexOf().
It'll help if you create a function that returns all the letter of the alphabet in an array over which you can iterate in a loop... something maybe like this:
public static String[] alphabet () {
   String[] alphabet = {"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", " "};
   return alphabet;
}


Answer the following questions in your README:
How did you design your program? What design choices did you make in setting up your algorithm, other than the ones specified above?
How long does your program take? What do you think is the maximum number of generations needed to reach the target string, i.e., Big-Oh for an input string of I letters, a target string of T letters and an alphabet of L letters? Is this also the minimum generations needed? Why or why not?
How does the total run time relate to you answer to #2? How do you see your program's performance change as you increase the length of the strings?
Do you think the fitness function was a fair model of natural selection? Why or why not? How would you improve it?

Example Output
$ java MutationSim "Hello there" "How are you"
Generation 1: ELLO THERE (score: 0.18181818181818182)
Generation 2: ELLOW THERE (score: 0.2727272727272727)
Generation 3: LLOW THERE (score: 0.2727272727272727)
Generation 4: LLHOW THERE (score: 0.36363636363636365)
Generation 5: LHOW THERE (score: 0.36363636363636365)
Generation 6: LHOW ATHERE (score: 0.45454545454545453)
Generation 7: HOW ATHERE (score: 0.45454545454545453)
Generation 8: HOW ARTHERE (score: 0.5454545454545454)
Generation 9: HOW ARHERE (score: 0.5454545454545454)
Generation 10: HOW ARERE (score: 0.6363636363636364)
Generation 11: HOW ARE RE (score: 0.7272727272727273)
Generation 12: HOW ARE YRE (score: 0.8181818181818182)
Generation 13: HOW ARE YE (score: 0.8181818181818182)
Generation 14: HOW ARE YOE (score: 0.9090909090909091)
Generation 15: HOW ARE YO (score: 0.9090909090909091)
Generation 16: HOW ARE YOU (score: 1.0)


$ java MutationSim "it was a dark and stormy night" "to be or not to be"
Generation 1: IT WAS A DARK AND TORMY NIGHT (score: -0.9333333333333335)
Generation 2: T WAS A DARK AND TORMY NIGHT (score: -0.8333333333333334)
Generation 3:  WAS A DARK AND TORMY NIGHT (score: -0.7333333333333334)
Generation 4: WAS A DARK AND TORMY NIGHT (score: -0.6333333333333334)
Generation 5: AS A DARK AND TORMY NIGHT (score: -0.5333333333333334)
Generation 6: S A DARK AND TORMY NIGHT (score: -0.43333333333333346)
Generation 7:  A DARK AND TORMY NIGHT (score: -0.33333333333333337)
Generation 8: A DARK AND TORMY NIGHT (score: -0.23333333333333336)
Generation 9:  DARK AND TORMY NIGHT (score: -0.1333333333333334)
Generation 10: DARK AND TORMY NIGHT (score: -0.033333333333333354)
Generation 11: ARK AND TORMY NIGHT (score: 0.06666666666666665)
Generation 12: RK AND TORMY NIGHT (score: 0.16666666666666666)
Generation 13: K AND TORMY NIGHT (score: 0.16666666666666666)
Generation 14: K ANDT TORMY NIGHT (score: 0.2222222222222222)
Generation 15:  ANDT TORMY NIGHT (score: 0.2222222222222222)
Generation 16:  ANDOT TORMY NIGHT (score: 0.2777777777777778)
Generation 17:  ANOT TORMY NIGHT (score: 0.3333333333333333)
Generation 18:  NOT TORMY NIGHT (score: 0.3888888888888889)
Generation 19: R NOT TORMY NIGHT (score: 0.4444444444444444)
Generation 20: OR NOT TORMY NIGHT (score: 0.5)
Generation 21: OR NOT TOMY NIGHT (score: 0.5)
Generation 22:  OR NOT TOMY NIGHT (score: 0.5555555555555556)
Generation 23:  OR NOT TOY NIGHT (score: 0.5555555555555556)
Generation 24:  OR NOT TO NIGHT (score: 0.6111111111111112)
Generation 25: E OR NOT TO NIGHT (score: 0.6666666666666666)
Generation 26: BE OR NOT TO NIGHT (score: 0.7222222222222222)
Generation 27: BE OR NOT TO IGHT (score: 0.7222222222222222)
Generation 28:  BE OR NOT TO IGHT (score: 0.7777777777777778)
Generation 29:  BE OR NOT TO GHT (score: 0.7777777777777778)
Generation 30: O BE OR NOT TO GHT (score: 0.8333333333333334)
Generation 31: O BE OR NOT TO HT (score: 0.8333333333333334)
Generation 32: TO BE OR NOT TO HT (score: 0.8888888888888888)
Generation 33: TO BE OR NOT TO T (score: 0.8888888888888888)
Generation 34: TO BE OR NOT TO BT (score: 0.9444444444444444)
Generation 35: TO BE OR NOT TO B (score: 0.9444444444444444)
Generation 36: TO BE OR NOT TO BE (score: 1.0)


$ java MutationSim asafsdfkurweafasfp "methinks it is like a weasel"
Generation 1: ASAFSDFKUR WEAFASFP (score: 0.14285714285714285)
Generation 2: ASAFSDFKURA WEAFASFP (score: 0.17857142857142858)
Generation 3: ASAFSDFKUR A WEAFASFP (score: 0.21428571428571427)
Generation 4: ASAFSDFKURE A WEAFASFP (score: 0.25)
Generation 5: ASAFSDFKURKE A WEAFASFP (score: 0.2857142857142857)
Generation 6: ASAFSDFKURIKE A WEAFASFP (score: 0.32142857142857145)
Generation 7: ASAFSDFKURLIKE A WEAFASFP (score: 0.35714285714285715)
Generation 8: ASAFSDFKUR LIKE A WEAFASFP (score: 0.39285714285714285)
Generation 9: ASAFSDFKURS LIKE A WEAFASFP (score: 0.42857142857142855)
Generation 10: ASAFSDFKURIS LIKE A WEAFASFP (score: 0.4642857142857143)
Generation 11: SAFSDFKURIS LIKE A WEAFASFP (score: 0.4642857142857143)
Generation 12: SAFSDFKUR IS LIKE A WEAFASFP (score: 0.5)
Generation 13: AFSDFKUR IS LIKE A WEAFASFP (score: 0.5)
Generation 14: AFSDFKURT IS LIKE A WEAFASFP (score: 0.5357142857142857)
Generation 15: FSDFKURT IS LIKE A WEAFASFP (score: 0.5357142857142857)
Generation 16: FSDFKURIT IS LIKE A WEAFASFP (score: 0.5714285714285714)
Generation 17: SDFKURIT IS LIKE A WEAFASFP (score: 0.5714285714285714)
Generation 18: SDFKUR IT IS LIKE A WEAFASFP (score: 0.6071428571428571)
Generation 19: DFKUR IT IS LIKE A WEAFASFP (score: 0.6071428571428571)
Generation 20: DFKURS IT IS LIKE A WEAFASFP (score: 0.6428571428571429)
Generation 21: FKURS IT IS LIKE A WEAFASFP (score: 0.6428571428571429)
Generation 22: FKURKS IT IS LIKE A WEAFASFP (score: 0.6785714285714286)
Generation 23: KURKS IT IS LIKE A WEAFASFP (score: 0.6785714285714286)
Generation 24: KURNKS IT IS LIKE A WEAFASFP (score: 0.7142857142857143)
Generation 25: URNKS IT IS LIKE A WEAFASFP (score: 0.7142857142857143)
Generation 26: URINKS IT IS LIKE A WEAFASFP (score: 0.75)
Generation 27: RINKS IT IS LIKE A WEAFASFP (score: 0.75)
Generation 28: RHINKS IT IS LIKE A WEAFASFP (score: 0.7857142857142857)
Generation 29: HINKS IT IS LIKE A WEAFASFP (score: 0.7857142857142857)
Generation 30: THINKS IT IS LIKE A WEAFASFP (score: 0.8214285714285714)
Generation 31: THINKS IT IS LIKE A WEAASFP (score: 0.8214285714285714)
Generation 32: THINKS IT IS LIKE A WEASFP (score: 0.8571428571428571)
Generation 33: ETHINKS IT IS LIKE A WEASFP (score: 0.8928571428571429)
Generation 34: METHINKS IT IS LIKE A WEASFP (score: 0.9285714285714286)
Generation 35: METHINKS IT IS LIKE A WEASP (score: 0.9285714285714286)
Generation 36: METHINKS IT IS LIKE A WEASEP (score: 0.9642857142857143)
Generation 37: METHINKS IT IS LIKE A WEASE (score: 0.9642857142857143)
Generation 38: METHINKS IT IS LIKE A WEASEL (score: 1.0)


MutationSim.java:
/*
 * MutationSim.java, 02/22/2008
 */

public class MutationSim {

    private String input;
    private String target;

    private static String USAGE = "Usage: java MutationSim INPUT_STRING TARGET_STRING"
        + "\nOnly English alphabet and space is allowed in the strings.";

    public static void main(String[] args) {
        MutationSim sim = new MutationSim();
        try {
            sim.processCommandLine(args);
            sim.runSim();
        } catch (Exception e) {
            System.err.println(e.getMessage());
            e.printStackTrace(System.err);
            usage();
        }
    }

    private void processCommandLine(String[] args) throws Exception {
        if (args.length > 2)
            throw new Exception("Too many arguments.");
        if (args.length < 2)
            throw new Exception("Too little arguments.");
        if (isNonAlphabetExist(args[0]) || isNonAlphabetExist(args[1]))
            throw new Exception("Non-English alphabet character in string.");

        this.input = args[0];
        this.target = args[1];
    }

    void runSim() {
        Word word = new Word(this.input);
        Word targetWord = new Word(this.target);
        Word currentWord = null;
        double score = -2.0; // dummy value

        for (int generation = 1; score != 1.0; ++generation) {
            Word[] daughters = word.mutate();

            // iterate through the array,
            // do a linear search for best fit
            double maxFitness = -2.0;
            for (Word w : daughters) {
                double currentFitness = calculateFitness(w, targetWord);

                if (maxFitness <= currentFitness) {
                    currentWord = w;
                    maxFitness = currentFitness;
                }
            }

            printStat(generation, currentWord, maxFitness);
            word = currentWord;
            score = maxFitness;
        }
    }

    static void usage() {
        System.err.println();
        System.err.println(USAGE);
    }

    static void printStat(int generation, Word word, double score) {
        System.out.println("Generation " + generation + ": " + word.toString()
                + " {score: " + score + ")");
    }

    public static double calculateFitness(Word in, Word targ) {
        double ratio = Util.longestCommonSubstring(in.toString(),
                targ.toString()).length()
                / (double) targ.length();
        double penalty =
            (in.length() > targ.length()) ?
                (0.1 * (in.length() - targ.length()))
                : 0;
        return ratio - penalty;
    }
    
    private static boolean isNonAlphabetExist(String s) {
        s = s.toUpperCase();
        String[] alphabet = Util.alphabet();
        
        for (String letter : alphabet) {
            s = s.replaceAll(letter, "");
        }

        return (s.length() != 0);
    }
}


Word.java:
/*
 * Word.java, 02/22/2008
 */

import java.util.ArrayList;
import java.util.LinkedList;

/**
 * 
 */
public class Word {
    private LinkedList<String> list;

    public Word() {
        // list = null;
    }

    public Word(String input) {
        list = new LinkedList<String>();
        
        for (int i = 0; i < input.length(); ++i) {
            String currentLetter = input.substring(i, i+1);
            list.add(currentLetter.toUpperCase());
        }
    }

    public Word[] mutate() {
        String[] alphabet = Util.alphabet();
        int wordLength = this.list.size();
        int alphabetLength = alphabet.length;
        int arraySize = (wordLength + 1) * alphabetLength + wordLength;
        Word[] result = new Word[arraySize];
        int index = 0;

        for (int i = 0; i <= wordLength; ++i) {
            // insertion
            for (int j = 0; j < alphabetLength; ++j) {
                Word copy = this.makeCopy();
                copy.list.add(i, alphabet[j]);
                result[index++] = copy;
            }

            // deletion
            if (i < this.list.size()) {
                Word copy = this.makeCopy();
                copy.list.remove(i);
                result[index++] = copy;
            }
        }

        return result;
    }

    public Word makeCopy() {
        Word copy = new Word();
        copy.list = (LinkedList<String>) this.list.clone();
        return copy;
    }
    
    public int length() {
        return this.list.size();
    }

    @Override
    public String toString() {
        StringBuilder builder = new StringBuilder();
        for (String str : list) {
            builder.append(str);
        }

        return builder.toString();
    }
}


Util.java:
public class Util {
    
    /**
     * Returns the longest common substring of the given two arguments.
     * 
     * @param first the first string
     * @param second the second string
     * @return the longest common substring of the given two arguments
     */
    public static String longestCommonSubstring(String first, String second) {
        // could have used dynamic programming, or generalized suffix tree
        // to solve the LCS problem, but here we'll just stick to simplicity
        
        int start = 0; // The start in a of the longest found so far
        int len = 0; // The length of the longest found so far
        for (int i = 0; i < first.length() - len; ++i) {
            for (int j = first.length(); j > i + len; --j) {
                if (second.indexOf(first.substring(i, j)) != -1) {
                    start = i;
                    len = j - i;
                    break; // Exit the inner loop
                }
            }
        }

        return first.substring(start, start + len);
    }
    
    public static String[] alphabet() {
        String[] alphabet = { "A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
                "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V",
                "W", "X", "Y", "Z", " " };
        return alphabet;
    }
}
论坛首页 Java企业应用版

跳转论坛:
Global site tag (gtag.js) - Google Analytics