The University of Queensland Homepage
School of ITEE ITEE Main Website

 Assignment 2: Machine Learning

COMP3702/COMP7702

Assignment 2: Handwritten letter recognition

Due 5pm, Friday, 23 October 2009.

This assignment counts 20% toward the final marks.

Purpose

This assignment is intended to practically acquaint you with machine learning methodology and techniques. Additionally, it should provide some understanding of how simple pattern recognition tasks, such as handwritten letter/digit recognition (popular in handheld and ubiquitous computing, and optical character recognition applications), can be implemented.

Code and Data

The supporting material can be used freely in your work: zipped source code, package descriptionzipped data set.

The supporting code is in Java and we recommend that you to use and modify this to complete the assignment. If you do not know Java currently, you will not need to learn a great deal of Java to do the assignment, but working in a pair might be a good idea.

The data set of over 5000 characters was generated by students in a previous year using GenerateLetters.java. Seeing the working code may give you ideas (e.g. for preprocessing), but you do not need to generate a new data set.

(See also known issues with the code and data set)

Administration

The assignment is worth 20% of your final grade and you may work individually or in pairs. Working individually will attract an automatic 1 bonus mark. As with the first assignment, there are more than the maximum marks available. You may attempt as many marks as you wish (to create a “buffer” of marks), but your grade will be capped at 20 marks for this assignment, even if your raw score is over 20. You are required to submit two separate components for this assignment, these are:

·         A package file containing all the source code files (e.g. *.java), including any you modify. Modified sections of source code should include comments and overall, the code should compile. There is no excuse for handling in a program which does not compile/interpret or run. If you are having trouble, some features may be omitted. Package and submit all the source code (*.java), including unmodified files, maintaining directory structure as necessary. You may package the source code into e.g. a .zip or .jar file.

·         A pdf with the written component of the assignment (see template/example for general guidance). Please indicate in your pdf which source code files you modified and if there are any new files. Also indicate in your pdf exactly what parts you have attempted, in part or in full, so that we know what to look for when marking.

Submission of both parts is via the ITEE online submission website found at http://submit.itee.uq.edu.au. Your group name should be the student number of the person submitting the assignment. You should also use your student number(s) within the pdf and code, and as part of the pdf and packaged code filenames. Note:  There will be marks deducted if your PDF and code do not contain your names and Student IDs.

If you resubmit your assignment, please resubmit all files relating to your solution. Your most recent submission will be the only submission marked.

In line with University policy collusion and plagiarism will not be tolerated; see the course profile for more details on these topics.

Each group must submit an assignment cover sheet signed by all members of the group. The online submission system will offer you a printable cover sheet as you submit your assignment. Otherwise, a suitable ITEE assignment cover sheet can be obtained from http://studenthelp.itee.uq.edu.au/assignments/ . You should drop your completed form into the COMP3702/7702 submission box on level 1 of GP South as soon as possible after submitting your assignment.

You can discuss aspects of the assignment with other members of the class on the comp3702 newsgroup (see my.uq.edu.au). If you want to ask the lecturer any questions, sending an email to ruth@itee.uq.edu.au is the best option.

The Assignment

You have been given a large dataset which contains samples of “handwritten” letters. You are to investigate how machine learning techniques create models to correctly classify instances of handwritten letters.

Source code has been provided which implement a simple version of each of these machine learning techniques (see Code and Data above).

Part

Description

Marks

1

Basic Study

You may complete one or both of the tasks 1A and 1B

Investigate and optimise the performance of the already implemented machine learning technique. Divide the original data set into at least two sets (training and test).

For each task attempted, your assignment must contain the following:

1.      Describe the optimisations investigated (1 mark) and

2.      Describe and discuss the performance of the system on training and test data (2 marks).

-

1A

Neural Networks

Optimisations include the learning rate, the number of iterations of learning, and the size of the test/training sets.

3

1B

ID3 Decision Trees

Optimisations include the size of the test/training sets.

3

2

Advanced Study

You may complete one or both of the tasks 2A and 2B. It is important to note that Part 1A must have been attempted to be eligible for marks in 2A and that Part 1B must have been attempted to be eligible for marks in 2B.

Having completed the Basic Study, you can now extend the technique to produce a more advanced classifier. You are to modify the appropriate source code provided. You should optimise the performance of your advanced classifier.

For each task attempted, your assignment must contain the following:

1.      Describe how the basic implementation was extended (1 mark),

2.      Provide and describe pseudo-code for your extensions to the basic implementation (0.5 marks),

3.      Describe how the supplied code was altered, including a clear description of how it should be run (0.5 marks),

4.      Provide correct implementation of your extensions in Java (1 mark), and

5.      Describe and discuss the performance of the system on training and test data (2 marks).

-

2A

Multi-layer neural network

(See Russell and Norvig, p. 744-748.)

Add hidden units to the single-layer neural network to give it two layers of weights. Optimisations include the number of hidden units.

5

2B

ID3 decision tree with pruning

(See Russell and Norvig, p.662-663.)
Use any relevance-based heuristic to remove paths (“prune” decisions) in the decision tree. Optimisations include the cut-off used for pruning.

5

3

Compare Study

To be eligible for marks in Part 3 you must have attempted all previous parts (i.e. 1A, 1B, 2A, and 2B).

Your assignment must contain the following:

1.      Discuss whether Neural Networks or ID3 Decision Trees is easier to implement (0.5 marks),  and

2.      Discuss whether Neural Networks or ID3 Decision Trees performs better in your implementation (0.5 marks).

1

4

Refine Study

Any number of the following tasks (4A-4D) may be completed. All are assigned the same marks (4 marks each). It is important to note that the Parts 1 and 2 (i.e. [1A and 2A] OR [1B and 2B]) MUST have been attempted to be eligible for marks outlined in this section. You may attempt any task(s) of this section.

Your assignment must contain the following:

1.      Describe the method (1 mark),

2.      Describe your implementation (1 mark),

3.      Provide correct implementation of the method (1 mark), and

4.      Describe the performance on training and test data, especially in comparison to the basic study (1 mark).

4A

Create an Ensemble of Classifiers

Make multiple copies of your classifier, and train these copies on random subsets of your training data. Produce an overall prediction in response to a test pattern by combining the results from multiple classifiers via a voting system. See Russell and Norvig, section 18.4 on p. 664-668 for some background.

4

4B

Implement a pre-processing technique

Implement some pre-processing technique that is run on each data instance before it is presented to the classifier for training/testing and which you feel may help the classifier perform better. Ideas in this area may include centring the letter in its 32 * 32 bitmap square, emphasising curves or straight edges that appear in the instance, or even compressing the letter into a smaller bitmap area (e.g. 8 * 8). The choice here is up to you. 

4

4C

Overtraining Prevention

There are a number of techniques that may be used to ensure the classifier is not overtraining. One option is to partition your training set into two new sets (a training set, and a validation or verification set) and to use this validation set to estimate when to stop training the classifier. Again, the choice is up to you. See Russell and Norvig, p. 661-663 for some pointers.

4

4D

Additional Refinement

Implement some technical refinement that aims to improve the performance of the classifier and that has not been specified in 4A-C.

4

5

Discussion

Discuss how you could implement the basic and advanced study with another machine learning technique (i.e. not Neural Networks or Decision Trees). (1 mark)

Discuss other ways that you could improve the performance of the system (i.e. other than using an ensemble, implementing a pre-processing technique, and overtraining prevention). (1 mark)

You should refer to published literature in this section, appropriately referencing this work, including a list of references. (1 mark)

3

6

Competition

Enter your trained classifier in a class competition, to be tested on a new set of data (details given below).

+1 mark for entering the competition with a classifier that compiles, makes relevant use of a machine learning algorithm, and produces predictions for all test cases.

+1 mark for the best four submissions from the competition. This will largely be determined by classifier accuracy on the test set, but other factors will be considered if necessary.

2

-

Individual Work

1

Competition Details

You have the option of entering your classifier in a class competition, which will be run during the final lecture of the semester (Tuesday 27 October). Your classifier code will be tested on a new set of data to see the proportion of these examples which it correctly classifies. The results will also be placed on the course web page. This is not compulsory but is highly recommended (both for seeing the results of independent testing and for the possibility of extra marks).

To take part in the competition, submit your classifier code, named Classifier_XX.java (where XX is your student number(s)) and the serialized instance of this class, named classifier_XX.ser (the file you get when you use bitmap.Classifier.save), before 9am on Monday 26 October.  (Note that this is after the rest of the assignment is due.) Submission is via the ITEE online submission website found at http://submit.itee.uq.edu.au to the assignment competition folder.

All additional java classes should be physically contained within your java file (as inner/private classes). Your class must belong to the bitmap package. Moreover, your class must extend (i.e. be a subclass of) the original bitmap.LetterClassifier (if not, your serialized classifier file will not load). You can assume that you have access to all java classes and data files that are part of the distributed code. Make sure that the serialized file is generated by the same class definition as you submit. Even changing the class name will invalidate the serialized file.

The competition system will re-construct your classifier instance. Then each new letter is tested on your classifier by calling Classifier_XX.test(Bitmap map). The classifier needs to operate very quickly (much less than 1 second on a standard PC) and load within 3 seconds (on a standard PC). The program will be tested and the coordinator reserves the right to disqualify submissions (especially if they don't compile without modification). From a letter recognition point of view, the competition application will work precisely as bitmap.UseClassifier.java. In the competition your classifier will only be identified by the name you give it (in case you wish to remain anonymous in class). Please change the name of your classifier from the default “NN Classifier 1” or “ID3 Classifier 1”.

Only one classifier per group can be used in the competition (but it can be an ensemble). To prepare for a successful competition you should try a number of different configurations, evaluate them objectively, and select one (or a combination) of them. Note that you must make relevant use of a machine learning algorithm. It is not sufficient to come up with a smart program.

Participating in the competition will attract 1 bonus mark (code compiles, meets criteria and produces predictions for all test cases). The best four submissions from the competition will receive another bonus mark. This will largely be determined by classifier accuracy on the test set, but other factors will be considered if necessary.

Questions?

FAQ