When Working With Code...: TextAnalyzer: Reading and Writing Text Files in Java

This will be a quick tutorial on text file input/output in Java. For this project I will be creating a short program that will take text files and return a file of statistics such as number of times each word occurs, number of words, etc. The text files I will be using come from Project Gutenberg, an online project which allows you to download many books as text files.

This project will be constructed with 4 classes: Tester, FileTextReader, TextAnalyzer
- Tester will be the class in which we run all of the tests on the other files.
- FileTextReader will be responsible for reading the file and returning an ArrayList of strings of the contents.
- TextAnalyzer will take the file name and call on FileTextReader to get the ArrayList of file contents. It will then perform the analysis on them, creating a new file summarizing the findings.

FileTextReader

First we will begin with the constructor. The FileTextReader will need a File and a Scanner. The Scanner will be used to parse the text in the given File. The File will be initialized in the constructor from a file name like so:

private File fileToRead = null;
private Scanner myScanner = null;

public FileTextReader(String fileName) {
fileToRead = new File(fileName);
}

Next we will write the readText() method of the FileTextReader class, which will split the contents of the file up by spaces and return them in an ArrayList of strings:

public ArrayList<String> readText() {
ArrayList<String> textList = new ArrayList<String>();
String toAdd = null;
try {
//get scanner
myScanner = new Scanner(fileToRead);
//while there is a token to take
while(myScanner.hasNext()) {
//if token is made of legal chars
toAdd = myScanner.next().replaceAll("[^a-zA-Z\' ]", "");
if (toAdd.length() > 0) {
//add it!
textList.add(toAdd);
}
}
} catch (FileNotFoundException e) {
System.out.println("Error: File not found");
e.printStackTrace();
}
return textList;
}

readText() uses the Scanner to parse through the input file and save the contents to an ArrayList. The next() method of the Scanner object will return the next sequence of characters up to a space in the input file. For instance, if the input file was "foo bar", the first call to next() would produce "foo", and the second would produce "bar". The hasNext() method returns true if there is a token left in the input file, and false otherwise. Thus we used the hasNext() method as the test case for the while loop, making sure we grabbed each token of the file and placing it into the ArrayList.

replaceAll() is a String method which replaces any characters matching the regex in the first parameter with the character in the second parameter. In this case, we are replacing any character that is not a letter or an apostrophe with an empty space. We then check to ensure the length of the string is greater than 0. This ensures that stray bits of characters such as "----" or "1." do not get added into the text file. This will keep us from getting false word counts.

This will complete the FileTextReader class.

TextAnalyzer

Next we will write the TextAnalyzer class. This class will be responsible for analyzing the ArrayList generated in the FileTextReader class. To begin, we will create the following constructor:

private ArrayList<String> contents = null;
private String fileName = null;
private PrintWriter writer = null;

public TextAnalyzer(String fName) {
contents = (new FileTextReader(fName)).readText();
fileName = fName;
}

The TextAnalyzer takes a file name, and then stores the ArrayList of the contents of that file by creating a FileTextReader object and calling readText() on it. This saves time later, because if we store this information at the beginning, we will not have to recalculate it every time we want to perform a different analysis. Next, it saves the file name, which will be used to name the text files we will output.

With the constructor complete, it's time to write our first analysis method. For now, we will make a simple word count.

public String wordCount(boolean writeOut) {
String count = "Word count: " + contents.size();
if (writeOut) {
try {
writer = new PrintWriter(fileName + "-word-count.txt", "UTF-8");
} catch (FileNotFoundException | UnsupportedEncodingException e) {
System.out.println("Error: " + fileName + " word count failed");
e.printStackTrace();
}
writer.println(count);
writer.close();
}
return count;
}

This method begins by storing a string with the current word count. The method gets the word count by using the size() method of the ArrayList contents which was initialized in the constructor. The portion that creates a text file output of the word count is encased in an if statement because we may not always want to produce an output file. For instance, later we will write a method which will perform all of the analysis methods we have created on a given file. For this use, we do not want all of the files generated; rather we will want all of the information placed within a single output file.

To create the output file, we create a PrintWriter. The parameters used to create the PrintWriter are the name of the output file (be sure to include the extension), and the format. Once the PrintWriter has been successfully initialized, we can write to the output file using methods such as print() and println(). The difference between these methods is println() will create a new line at the end of the input, where as print() will not terminate the line after printing the string.

Now that we have a basic method to analyze the text, it's time to create the Tester and make sure everything is good so far.

Tester

The tester will be a simple class that will run our code. It does not need a constructor; we will only be writing a main() method for it. We will also need to write a short test text file. For the initial test, I just made a small file 'test.txt' that contained the text "hello world". Save this file in the same folder as your code. The Tester class will only need this in it for now:

public static void main(String[] args) {
TextAnalyzer myAnalyzer = new TextAnalyzer("test.txt");
myAnalyzer.wordCount(true);
}

Now when you run the Tester file, you should see a new file 'test.txt-word-count.txt' in your directory with the contents:
Word count: 2

Running this program on a download of Moby Dick from Project Gutenberg gives a word count of 214872.

Congratulations! Our initial text analyzer works. In following posts more analysis tools will be added, along with the option of analyzing entire folders at a time instead of single files.

When Working With Code...

Friday, January 17, 2014

TextAnalyzer: Reading and Writing Text Files in Java

No comments:

Post a Comment