6. Conclusions


The main strength of the system described in chapters 4 and 5, as with other structural pattern recognition systems, lies in its flexibility. The system is, in effect, omnifont and size-independent and its significant tolerance of noise, limited rotation, broken print and distortion is a major asset. The manipulation of areas of black pixels (sections), as entities in themselves, enables fast processing. The only pixel level operations are carried out in the first pass over the data and even these could be undertaken using hardware already incorporated in an appropriate scanner. In addition, parallel processing techniques could be employed, by breaking up the image and allocating each portion to its own processor for concurrent production of run-length encoding, formation of transformed LAG, or both. All routines are implemented in 'C' under the UNIX operating system and run on both a Gould minicomputer and a Hewlett Packard 300 series desktop workstation, illustrating their portability.




Further comment needs to be made with regard to run-times and memory requirements. Any form of processing of full A4-page images at 300 d.p.i., even binary images stored using bit-per-pixel representation, requires significant amounts of memory. Unfortunately, neither of the above machines provided more than 1 1/2 to 2 Mbytes of user workspace. Given that a single image may require approximately 1 Mbyte and storage of data structures for representing sections and objects could increase this requirement



-150-

to between six and 10 Mbytes, it can be appreciated that problems were encountered due to page faults when processing large images. In the case of the extract shown in figure 4.10, however, total processing time was approximately 30s, from reading the image from disk to producing the text interpretation file.




The processing of the image of figure 5.17 through to producing the SCORE M.R.L. data file shown in figure 5.19 took approximately 45s. The increase in processing time is roughly in proportion to the area of the score image involved, given the limitations of the development environment mentioned above for testing this. The production of run-length encoding and section data requires a significant proportion of the total processing time, but obviously the run-time of procedures such as stave finding depend on the musical content of the particular image involved.




The structural approach, by its nature, allows for both 'character' and 'varying graphical parameter' types of symbol. As with Prerau's efforts, an easily expandable program was the aim and has in the most part been achieved. The overall structure of the software was kept modular so that separate routines could be modified, replaced, optimized or expanded. The system is a real one in Mahoney's terms, in that it tackles 'real world' problems but still needs expanding to cope with a wider range of symbols and score formats. As mentioned in section 3.3.2, a trade-off exists between robustness and simplicity of the recognition task. Processing stavelines is a prime example. If assumptions are made that the lines are horizontal or perhaps, as Roach and Tatem



-151-

specified, exist across a large proportion of the width of the page, then their detection becomes far easier. In 'real world' situations, however, as Prerau noted, the lines cannot be assumed to be exactly parallel, horizontal, equidistant, of constant thickness or straight. Also, stavelines may be obscured to a significant extent by multiple beams, particularly where these are horizontal. All the above factors combine, in the general case, to invalidate the use of standard image processing techniques for locating stavelines, a fact corroborated by Roach and Tatem. The system was developed with these facts in mind.




As outlined in sections 4.10 and 4.11, the current methods could be extended to include all symbols which contain vertically-orientated linear components, i.e. individual notes with stems, all beamed groups, barlines, quaver (and shorter duration) rests, sharp, natural, flat and double flat signs, boxes (surrounding bar numbers or text) and brackets. Also included would be those symbols which can be modelled using a graph structure-based representation of the simplified original transformed LAG, in conjunction with the appropriate parameters for describing the dimensions and spatial organisation of the constituent sections. These two techniques operating in conjunction should cover the majority of conventional music symbols. It is anticipated that difficulties will be encountered where, for example, grace notes or other small symbols cannot be consistently analysed by structural breakdown due to the limits on print quality and scanning resolution.




Some areas have been identified which may prove problematical



-152-

in the context of future work, and possible solutions or approaches have been suggested. The difficulties resulting from fragmentation of symbols due to poor print quality have been circumvented to a significant extent by the techniques presented in chapters four and five. Severe break-up of symbols will, however, continue to be a problem for a topology-based approach and will probably necessitate the use of artificial intelligence-based techniques in order to take advantage of higher level musical information. Similarly, suggestions have been made regarding the processing of overlapping symbols, but with the large number of possible combinations of symbols which are permitted to intersect, it would be unrealistic to expect any single method to cope with all situations. A method based on isolating characteristic sections within a composite object, and using these as the basis of an iterative process for separating out the merged symbols, has been proposed as a limited solution.




The main areas of concern pertaining to more widespread use of existing techniques have been mentioned in chapter five. These are symbols which remain attached after staveline removal, removal of symbol components by the staveline identification process and residual staveline fragments. The first of these can, perhaps, be categorised with the question of overlapping symbols as it may be a product of the original engraving rather than the result of the staveline identification technique. Otherwise, the problems all seem related to the initial symbol isolation, for if this is not perfect, there will always be a need for compromise methods which compensate for the weaknesses of the staveline-finding technique. As discussed in chapter five, the present approach to staveline-finding may, in the future, be


-153-

supplemented by a clustering-based technique. This should give better symbol isolation and hence further simplify the recognition task.




From the start of the research, it was acknowledged that the text present in music images would be treated separately. After isolation, recognition of the text could be achieved either using existing techniques, possibly modified to use a dictionary (in processing Italian terms and the like), or by making use of a new method aimed at identifying not just the text but also its font. It was hoped that the segmentation techniques developed for processing the music symbols could be applied to some degree when dealing with text and this may prove possible in the case of the orthogonal LAGs.




The intention was to place no restriction on the symbols which could eventually be included. Similarly, due to the way processing was carried out, no limit was placed on the formats which could be read. Chords and multiple voices on a single stave have not been excluded by the approach which has been taken and it is hoped that more complex music including these features will be fully encompassed as part of future research. Originally, a target processing time of five to 10 minutes was set for a complete A4 page of music of 'average' complexity and the results so far achieved indicate that this will be met. The use of run length encoding as the basis of the segmentation process also satisfies a desire for data compression and consequently opens up the possibility of using a scanner with this facility built in to its hardware (usually this is in connection with Group III/IV facsimile transmission), as mentioned in section 4.3.



-154-

The fixing of the line thickness threshold used in staveline-pixel allocation before commencement of processing is a weakness of several of the systems described in section 3.3. The author's system avoids this by setting up the line thickness threshold based on the thickness of all the filaments found in the image. Importantly, the threshold is applied to complete sections as an average thickness measure. Filaments themselves are located using a purely relative measure, namely aspect ratio.









-155-