Friday, January 17, 2014

TextAnalyzer: Adding a Lexicon Using Java's HashMap

This will be part 2 of the TextAnalyzer post.  In this one, we will create a second analysis method.

TextAnalyzer

We will be adding a lexicon method to the TextAnalyzer class.  The lexicon is a useful and interesting tool, as it will analyze the text and return how many times each word is used in it.  For instance, if we called the lexicon method on a file with "hello hello world" in it, the lexicon will return the following:
hello: 2
world: 1

To produce this information we will use Java's HashMap.  The HashMap allows you to associate a key (the word) with a value (number of times the word is used).  We will traverse the ArrayList, and at each item in the list we will see if it is already in the HashMap.  If the word is in the HashMap, we will update it's associated value to be the value + 1 (because we have found it one more time).  If it is not in the HashMap, we will add it in with the value being 1, because it is the first time we have seen it.  The code for the lexicon method is as follows:

public String lexicon(boolean writeOut) {
 //output will be stored in this StringBuffer
 StringBuffer lexStringBuf = new StringBuffer("Lexicon: \n \n");
 HashMap<String, Integer> lexMap = new HashMap<String, Integer>();
 //following for loop iterates through
 //each item in the ArrayList
 for (String s : contents) {
  //remove punctuation except "'" and 
  //convert string to lowercase
  s = s.toLowerCase();
  if (lexMap.containsKey(s)) {
   lexMap.put(s, (lexMap.get(s) + 1) );
  } else {
   lexMap.put(s, 1);
  }
 }
 //now lexMap contains lexicon
 //use an iterator to traverse the map
 //and add to the StringBuffer
 Iterator<String> iterator = lexMap.keySet().iterator();  
  
 while (iterator.hasNext()) {  
  String key = iterator.next();  
  String value = lexMap.get(key).toString();      
  lexStringBuf.append(key + ": " + value + "\n");
 }
 //stringbuffer now has full output
 //create new file if writeout is true
 if (writeOut) {
  try {
   writer = new PrintWriter(fileName + "-lexicon.txt", "UTF-8");
  } catch (FileNotFoundException | UnsupportedEncodingException e) {
   System.out.println("Error: " + fileName + " lexicon failed");
   e.printStackTrace();
  }
  writer.print(lexStringBuf.toString());
  writer.close();
  }
 //return string
 return lexStringBuf.toString();
}

The first portion creates a StringBuffer (making it easy to create a large string by simply calling append()), and the HashMap which will map Strings to Integers.

The for loop goes through each String in the contents ArrayList (from the first TextAnalyzer post), converts it to lowercase (to avoid The and the getting separately matched), and then adds it to the HashMap.  If the word is already in the map, it updates the value to value+1, otherwise it adds it to the map with a value of 1.

With the HashMap filled out, the Iterator portion is where we traverse the HashMap to get all of the key/value pairs and add them to our StringBuffer.  The Iterator is initialized to the Iterator of the key set of our HashMap.  This means that the Iterator will be traversing the key set.  The key set is the set of keys from our HashMap; in this case the set of words in the HashMap.  While there is a key, it grabs the key and the associated value, and then adds it to our StringBuffer.  

The format of writing to the file will be improved in a later post to list the most common words first, and to give the percentage of the total word count that each word makes up.  But for now we will stick with the simple example.

After creating the string, we again face the choice of simply returning it or writing it to a file and returning it, which is again handled by the if (writeOut) statement.  This will complete the lexicon() method for now!


Tester

It's now time to update our Tester.  In the main method, add the following line after the creation of myAnalyzer:

myAnalyzer.lexicon(true);

Now go ahead and run your program and see what you get!
For the Moby Dick example, the output should look something like:

Lexicon: 
brave: 18
morgana: 1
approving: 1
champions: 1
unaccountable: 16
cripple: 3
writerof: 1
jew: 1
...... and so on!

There you have it!
Again, the lexicon will be improved later, but will focus on using Comparators for custom classes.

1 comment:

  1. This blog is nice and very informative. I like this blog.
    blog Please keep it up.

    ReplyDelete