Table of Contents

Download from the Weka website, click "Download" on the left, and select one of the "Stable book 3rd ed." versions for your machine.

You may need to increase the memory available to Weka. On Windows, check out Weka's page on the topic. On a Mac, use this command in the Terminal:

java -Xmx4096m -jar /Applications/weka-3-6-8.app/Contents/Resources/Java/weka.jar

k-means clustering


Figure 1: Click "Choose" to choose the clustering algorithm.


Figure 2: Choose "SimpleKMeans."


Figure 3: Click the name of the algorithm to get more options.


Figure 4: Possibly change the number of clusters or distance function.


Figure 5: For evaluation (including the confusion matrix), choose Classes to clusters evaluation and make sure "(Nom) class" (nominal variable called "class") is selected for comparison.

k-nearest neighbor classification


Figure 6: Click "Choose" to choose the classification algorithm.


Figure 7: Choose "IBk."


Figure 8: Click the name of the algorithm to get more options.


Figure 9: Possibly change the number of clusters.


Figure 10: For evaluation (including the confusion matrix), choose "Cross-validation" with 10 folds.

Preprocessing text

As discussed in the text classification notes, text files typically need to be converted into "feature vectors" before machine learning algorithms can be applied. The most common feature vector is a vector where each dimension is a different word, and the value in that dimension is either 0/1 binary value (yes or no the word is in that document), an integer count (the frequency of the word in that document), or a real value (often the tf-idf score of that word in that document).

The ARFF files we will be working with will always look like this:

@relation some-description-of-the-data
@attribute contents string
@attribute class {ham,spam}

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',ham
'Ok lar... Joking wif u oni...',ham
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&Cs apply 08452810075over18s',spam
'U dun say so early hor... U c already then say...',ham
'Nah I dont think he goes to usf, he lives around here though',ham

Of course, the classes (ham/spam) may change. These files are very easy to generate, should you actually wish to use text classification for your own purposes.

We'll practice with the sms-spam.arff file, which has examples of SMS text msg ham and spam. This collection was produced by Tiago A. Almeida from the Federal University of Sao Carlos and José María Gómez Hidalgo from the R&D Department of Optenet. More information available here. You may find their paper interesting; it examines the performance of various machine learning algorithms on this data set.

Strings to binary feature vectors


Figure 11: Load the ARFF file and choose a filter.


Figure 12: Choose the StringToWordVector filter, in the Unsupervised > Attribute section.


Figure 13: Click the filter name to get more options.


Figure 14: Change "attributeIndices" to "1" to just select the first attribute (the contents of the SMS message). Change "attributeNamePrefix" to "word-" so that all the new features look like "word-foo" and "word-bar", etc. Change "wordsToKeep" to "100000" to keep them all.


Figure 15: Click "Apply"


Figure 16: Now you'll have lots more attributes. You can click "Undo" if you don't like it.


Figure 17: Everything works better if the class (ham/spam) is last. After our filter, it's not anymore. Let's fix that. Choose a new filter, "Reorder," under Unsupervised > Attribute. Click the filter name to change its properties. Type "2-last,1" to make the new order the second-to-last attributes followed by the first attribute. This puts the first attribute at the end.


Figure 18: Click "Apply." Now the class (ham/spam) should be the last attribute again.

Strings to frequency vectors

Follow the steps above ("Strings to binary feature vectors") except use the following changes in the StringToWordVector filter:


Figure 19: Change "outputWordCounts" to "True."

Strings to tf-idf vectors

Follow the steps above ("Strings to binary feature vectors") except use the following changes in the StringToWordVector filter:


Figure 20: Change "IDFTransform" to "True" and "TFTransform" to "True," as well as "normalizeDocLength" to "Normalize all data."

Intro to AI material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.