Saif Mohammad's Homepage

Software and Data (Masters)

Saif Mohammad
Advisor : Dr.Ted Pedersen

WORD SENSE DISAMBIGUATION
A Word Sense Disambiguation system using WEKA's C4.5 Decision Tree learning algorithm, based on individual and combinations of lexical and syntactic features. The system may be used to disambiguate any data in Senseval-2 data format. SyntaLex   README

PART-OF-SPEECH TAGGED, PARSED AND SENSE-TAGGED DATA IN SENSEVAL-2 DATA FORMAT
Senseval-3 Senseval-2 Senseval-1   line   hard   serve   interest

SENSE-TAGGED DATA IN SENSEVAL-2 DATA FORMAT
This is the data used to train and evaluate the word sense disambiguation system.
Senseval-2 (a small sample ) and Senseval-1 (a small sample ) Lexical Sample Space is available at the Senseval webpage .
Senseval-1 data has certain erroneous sense tags (view sample ). The cleaned up data: Csenseval1sampleREADME (May15, 2002)
Corrected Senseval-1 data in Senseval-2 data format: CSenseval1in2sampleREADME (May15, 2002)
Packages to convert LINE, HARD, SERVE and INTEREST data to Senseval-1 and Senseval-2 data formats:

LineOneTwo README (Jan, 2003)   HardOneTwo README (Jan, 2003)        ServeOneTwoREADME (Jan, 2003)     InterestOneTwoREADME (Jan, 2003)

Line data (a small sample ) in :      Senseval-1 data format: Line-S1sample      Senseval-2 data format: Line-S2 sample
Hard data (a small sample ) in :     Senseval-1 data format: Hard-S1sample       Senseval-2 data format: Hard-S2 sample
Serve data (a small sample ) in :    Senseval-1 data format: Serve-S1sample       Senseval-2 data format: Serve-S2   sample
Interest data (a small sample ) in :    Senseval-1 data format: Interest-S1sample       Senseval-2 data format: Interest-S2   sample

PART-OF-SPEECH TAGGING
Parts-of-speech may be assigned to any data in Senseval-2 data format using the package posSensevalREADME(Feb 23, 2003)The Brill Tagger is utilized to part-of-speech tag the data. Given that we know the part of speech of certain words in the data, the accuracy of tagging may be improved if we pre-tagged these words with their correct part-of- speech. Thereby, providing anchor points around which words may be tagged more reliably. A patch to the Brill Tagger which guarantees pre-tagging and also resolves a problem in the existing pre-tagging may be downloaded from here: BrillPatch README (Feb, 2003).The details of this work can be found in the recently accepted paper, "Guaranteed Pre-Tagging for the Brill Tagger " ABSTRACT. The paper is to appear in the proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, CICLing-2003 , February, 2003 in Mexico City.
Part-of-speech tagged Senseval-2 format data (Feb 23, 2003),
    Senseval-2 test data: test-S2.possample   Senseval-2 training data: train-S2.possample
Corrected Senseval-1 test data: test-S1.possample Corrected Senseval-1 training data: train-S1.possample
Line data: line.pos sample    Hard data: hard.pos sample     Serve data: serve.pos sample     Interest data: interest.pos sample

Last updated: Feb. 2005