COMP3702/COMP7702
Assignment 2: Handwritten
letter recognition
Due 5pm,
Friday, 23 October 2009.
This assignment counts 20% toward the final marks.
Purpose
This assignment
is intended to practically acquaint you with machine learning methodology and techniques.
Additionally, it should provide some understanding of how simple pattern
recognition tasks, such as handwritten letter/digit recognition (popular in
handheld and ubiquitous computing, and optical character recognition
applications), can be implemented.
Code and Data
The supporting material can be used freely in your work: zipped source code, package description, zipped data set.
The supporting code is in Java and we recommend that you to use and modify this to complete the assignment. If you do not know Java currently, you will not need to learn a great deal of Java to do the assignment, but working in a pair might be a good idea.
The data set of over 5000 characters was generated by students in a previous year using GenerateLetters.java. Seeing the working code may give you ideas (e.g. for preprocessing), but you do not need to generate a new data set.
(See also known issues with the code and data set)
Administration
The assignment is worth 20% of
your final grade and you may work individually or in pairs. Working
individually will attract an automatic 1 bonus mark. As with the first assignment,
there are more than the maximum marks available. You may attempt as many marks
as you wish (to create a “buffer” of marks), but your grade will be capped at
20 marks for this assignment, even if your raw score is over 20. You are
required to submit two separate components for this assignment, these are:
·
A
package file containing all the source code files (e.g. *.java), including any
you modify. Modified sections of source code should include comments
and overall, the code should compile. There is no excuse for handling in a program
which does not compile/interpret or run. If you are having trouble, some
features may be omitted. Package and submit all the source code
(*.java), including unmodified files, maintaining directory structure as
necessary. You may package the source code into e.g. a .zip or .jar file.
·
A pdf with the written component of the assignment (see template/example
for general guidance). Please indicate in your pdf
which source code files you modified and if there are any new files. Also
indicate in your pdf exactly what parts you have
attempted, in part or in full, so that we know what to look for when marking.
Submission of both parts is via
the ITEE online submission website found at http://submit.itee.uq.edu.au. Your group name should be the student
number of the person submitting the assignment. You should also use
your student number(s) within the pdf and code, and
as part of the pdf and packaged code filenames. Note: There will be marks
deducted if your PDF and code do not contain your names and Student IDs.
If you resubmit your
assignment, please resubmit all files relating to your solution. Your most
recent submission will be the only submission marked.
In line with University policy
collusion and plagiarism will not be tolerated; see the course profile for more
details on these topics.
Each group must submit an
assignment cover sheet signed by all members of the group. The online submission
system will offer you a printable cover sheet as you submit your assignment.
Otherwise, a suitable ITEE assignment cover sheet can be obtained from http://studenthelp.itee.uq.edu.au/assignments/
. You should drop your completed form into the COMP3702/7702 submission box on
level 1 of GP South as soon as possible after submitting your assignment.
You can discuss aspects of the
assignment with other members of the class on the comp3702 newsgroup (see my.uq.edu.au). If
you want to ask the lecturer any questions, sending an email to ruth@itee.uq.edu.au
is the best option.
The Assignment
You have been given a large dataset
which contains samples of “handwritten” letters. You are to investigate how
machine learning techniques create models to correctly classify instances of
handwritten letters.
Source code has been
provided which implement a simple version of each of these machine
learning techniques (see Code and Data above).
|
Part |
Description |
Marks |
|
1 |
Basic Study You may complete one or both of the tasks 1A and 1B Investigate and optimise the performance of the already implemented machine learning technique. Divide the original data set into at least two sets (training and test). For each task attempted, your assignment must contain the following: 1. Describe the optimisations investigated (1 mark) and 2. Describe and discuss the performance of the system on training and test data (2 marks). |
- |
|
1A |
Neural Networks Optimisations include the learning rate, the number of iterations of learning, and the size of the test/training sets. |
3 |
|
1B |
ID3 Decision Trees Optimisations include the size of the test/training sets. |
3 |
|
2 |
Advanced Study You may complete one or both of the tasks 2A and 2B. It is important to note that Part 1A must have been attempted to be eligible for marks in 2A and that Part 1B must have been attempted to be eligible for marks in 2B. Having completed the Basic Study, you can now extend the technique to produce a more advanced classifier. You are to modify the appropriate source code provided. You should optimise the performance of your advanced classifier. For each task attempted, your assignment must contain the following: 1. Describe how the basic implementation was extended (1 mark), 2. Provide and describe pseudo-code for your extensions to the basic implementation (0.5 marks), 3. Describe how the supplied code was altered, including a clear description of how it should be run (0.5 marks), 4. Provide correct implementation of your extensions in Java (1 mark), and 5. Describe and discuss the performance of the system on training and test data (2 marks). |
- |
|
2A |
Multi-layer neural network (See Russell and Norvig, p. 744-748.) Add hidden units to the single-layer neural network to give it two
layers of weights. Optimisations include the number of hidden units. |
5 |
|
2B |
ID3 decision tree with pruning (See Russell and Norvig, p.662-663.) |
5 |
|
3 |
Compare Study To be eligible for marks in Part 3 you must have attempted all previous parts (i.e. 1A, 1B, 2A, and 2B). Your assignment must contain the following: 1. Discuss whether Neural Networks or ID3 Decision Trees is easier to implement (0.5 marks), and 2. Discuss whether Neural Networks or ID3 Decision Trees performs better in your implementation (0.5 marks). |
1 |
|
4 |
Refine Study Any number of the following tasks (4A-4D) may be completed. All are assigned the same marks (4 marks each). It is important to note that the Parts 1 and 2 (i.e. [1A and 2A] OR [1B and 2B]) MUST have been attempted to be eligible for marks outlined in this section. You may attempt any task(s) of this section. Your assignment must contain the following: 1. Describe the method (1 mark), 2. Describe your implementation (1 mark), 3. Provide correct implementation of the method (1 mark), and 4. Describe the performance on training and test data, especially in comparison to the basic study (1 mark). |
– |
|
4A |
Create an Ensemble of
Classifiers Make multiple copies of your classifier, and train these copies on random
subsets of your training data. Produce an overall prediction in response to a
test pattern by combining the results from multiple classifiers via a voting
system. See Russell and Norvig, section 18.4 on p.
664-668 for some background. |
4 |
|
4B |
Implement a
pre-processing technique Implement some pre-processing technique that is run on each data instance
before it is presented to the classifier for training/testing and which you feel
may help the classifier perform better. Ideas in this area may include
centring the letter in its 32 * 32 bitmap square, emphasising curves or
straight edges that appear in the instance, or even compressing the letter
into a smaller bitmap area (e.g. 8 * 8). The choice here is up to you. |
4 |
|
4C |
Overtraining Prevention There are a number of techniques that may be used to ensure the classifier
is not overtraining. One option is to partition your training set into two
new sets (a training set, and a validation or verification set) and to use
this validation set to estimate when to stop training the classifier. Again,
the choice is up to you. See Russell and Norvig, p.
661-663 for some pointers. |
4 |
|
4D |
Additional Refinement Implement some technical refinement that aims to improve the performance
of the classifier and that has not been specified in 4A-C. |
4 |
|
5 |
Discussion Discuss how you could implement the basic and advanced study with another machine learning technique (i.e. not Neural Networks or Decision Trees). (1 mark) Discuss other ways that you could improve the performance of the system (i.e. other than using an ensemble, implementing a pre-processing technique, and overtraining prevention). (1 mark) You should refer to published literature in this section, appropriately
referencing this work, including a list of references. (1 mark) |
3 |
|
6 |
Competition Enter your trained classifier in
a class competition, to be tested on a new set of data (details given below). +1 mark for entering the competition with a classifier that compiles, makes relevant use of a machine learning algorithm, and produces predictions for all test cases. +1 mark for the best four submissions from the
competition. This will largely be determined by classifier accuracy on the
test set, but other factors will be considered if necessary. |
2 |
|
- |
Individual Work |
1 |
Competition Details
You have the option of entering
your classifier in a class competition, which will be run during the final
lecture of the semester (Tuesday 27 October). Your classifier code will be
tested on a new set of data to see the proportion of these examples
which it correctly classifies. The results will also be placed on the course
web page. This is not compulsory but is highly recommended (both
for seeing the results of independent testing and for the possibility of
extra marks).
To take part in the
competition, submit your classifier code, named Classifier_XX.java (where XX is
your student number(s)) and the serialized instance of this class, named
classifier_XX.ser (the file you get when you use bitmap.Classifier.save),
before 9am on Monday 26 October. (Note
that this is after the rest of the assignment is due.) Submission is via the
ITEE online submission website found at http://submit.itee.uq.edu.au
to the assignment competition folder.
All additional java classes
should be physically contained within your java file (as inner/private
classes). Your class must belong to the bitmap package. Moreover, your class
must extend (i.e. be a subclass of) the original bitmap.LetterClassifier
(if not, your serialized classifier file will not load). You can assume that
you have access to all java classes and data files that are part of the
distributed code. Make sure that the serialized file is generated by the same
class definition as you submit. Even changing the class name will invalidate
the serialized file.
The competition system will
re-construct your classifier instance. Then each new letter is tested on your
classifier by calling Classifier_XX.test(Bitmap map). The classifier needs to operate very quickly
(much less than 1 second on a standard PC) and load within 3 seconds (on a
standard PC). The program will be tested and the coordinator reserves the right
to disqualify submissions (especially if they don't compile without
modification). From a letter recognition point of view, the competition
application will work precisely as bitmap.UseClassifier.java.
In the competition your classifier will only be identified by the name you give
it (in case you wish to remain anonymous in class). Please change the name of your classifier from the default “NN
Classifier 1” or “ID3 Classifier 1”.
Only one classifier per group
can be used in the competition (but it can be an ensemble). To prepare for a
successful competition you should try a number of
different configurations, evaluate them objectively, and select one (or a
combination) of them. Note that you must make relevant use of a machine
learning algorithm. It is not sufficient to come up with a smart program.
Participating in the
competition will attract 1 bonus mark (code compiles, meets criteria and
produces predictions for all test cases). The best four submissions from
the competition will receive another bonus mark. This will largely be
determined by classifier accuracy on the test set, but other factors will be
considered if necessary.
