edu.stanford.nlp.parser.lexparser
Class ChineseCharacterBasedLexicon

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.ChineseCharacterBasedLexicon
All Implemented Interfaces:
Lexicon, java.io.Serializable

public class ChineseCharacterBasedLexicon
extends java.lang.Object
implements Lexicon

Author:
Galen Andrew
See Also:
Serialized Form

Field Summary
 
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
 
Constructor Summary
ChineseCharacterBasedLexicon(ChineseTreebankParserParams params, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)
           
 
Method Summary
 void finishTraining()
          Done collecting statistics for the lexicon.
 Distribution<java.lang.String> getPOSDistribution()
           
 UnknownWordModel getUnknownWordModel()
           
 void incrementTreesRead(double weight)
          If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens
 void initializeTraining(double numTrees)
          Start training this lexicon on the expected number of trees.
static boolean isForeign(java.lang.String s)
           
 boolean isKnown(int word)
          Checks whether a word is in the lexicon.
 boolean isKnown(java.lang.String word)
          Checks whether a word is in the lexicon.
 int numRules()
          Returns the number of rules (tag rewrites as word) in the Lexicon.
 void readData(java.io.BufferedReader in)
          Read the lexicon from the BufferedReader in the format written by writeData.
 java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word, int loc, java.lang.String featureSpec)
          Get an iterator over all rules (pairs of (word, POS)) for this word.
 java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word, int loc, java.lang.String featureSpec)
          Same thing, but with a string that needs to be translated by the lexicon's word index
 java.lang.String sampleFrom()
          Samples over words regardless of POS: first samples POS, then samples word according to that POS
 java.lang.String sampleFrom(java.lang.String tag)
          Samples from the distribution over words with this POS according to the lexicon.
 float score(IntTaggedWord iTW, int loc, java.lang.String word, java.lang.String featureSpec)
          Get the score of this word with this tag (as an IntTaggedWord) at this loc.
 void setUnknownWordModel(UnknownWordModel uwm)
           
 void train(java.util.Collection<Tree> trees)
          Train this lexicon on the given set of trees.
 void train(java.util.Collection<Tree> trees, java.util.Collection<Tree> rawTrees)
           
 void train(java.util.Collection<Tree> trees, double weight)
          Train this lexicon on the given set of trees.
 void train(java.util.List<TaggedWord> sentence, double weight)
          Not all subclasses support this particular method.
 void train(TaggedWord tw, int loc, double weight)
          Not all subclasses support this particular method.
 void train(Tree tree, double weight)
          TODO: make this method do something with the weight
 void trainUnannotated(java.util.List<TaggedWord> sentence, double weight)
          Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwize annotated tree.
 void writeData(java.io.Writer w)
          Write the lexicon in human-readable format to the Writer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ChineseCharacterBasedLexicon

public ChineseCharacterBasedLexicon(ChineseTreebankParserParams params,
                                    Index<java.lang.String> wordIndex,
                                    Index<java.lang.String> tagIndex)
Method Detail

initializeTraining

public void initializeTraining(double numTrees)
Description copied from interface: Lexicon
Start training this lexicon on the expected number of trees. (Some UnknownWordModels use the number of trees to know when to start counting statistics.)

Specified by:
initializeTraining in interface Lexicon

train

public void train(java.util.Collection<Tree> trees)
Train this lexicon on the given set of trees.

Specified by:
train in interface Lexicon
Parameters:
trees - Trees to train on

train

public void train(java.util.Collection<Tree> trees,
                  double weight)
Train this lexicon on the given set of trees.

Specified by:
train in interface Lexicon

train

public void train(Tree tree,
                  double weight)
TODO: make this method do something with the weight

Specified by:
train in interface Lexicon

trainUnannotated

public void trainUnannotated(java.util.List<TaggedWord> sentence,
                             double weight)
Description copied from interface: Lexicon
Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwize annotated tree.

Specified by:
trainUnannotated in interface Lexicon

incrementTreesRead

public void incrementTreesRead(double weight)
Description copied from interface: Lexicon
If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens

Specified by:
incrementTreesRead in interface Lexicon

train

public void train(TaggedWord tw,
                  int loc,
                  double weight)
Description copied from interface: Lexicon
Not all subclasses support this particular method. Those that don't will barf...

Specified by:
train in interface Lexicon

train

public void train(java.util.List<TaggedWord> sentence,
                  double weight)
Description copied from interface: Lexicon
Not all subclasses support this particular method. Those that don't will barf...

Specified by:
train in interface Lexicon

finishTraining

public void finishTraining()
Description copied from interface: Lexicon
Done collecting statistics for the lexicon.

Specified by:
finishTraining in interface Lexicon

getPOSDistribution

public Distribution<java.lang.String> getPOSDistribution()

isForeign

public static boolean isForeign(java.lang.String s)

score

public float score(IntTaggedWord iTW,
                   int loc,
                   java.lang.String word,
                   java.lang.String featureSpec)
Description copied from interface: Lexicon
Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag).)

Specified by:
score in interface Lexicon
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial.
word - The word itself; useful so we don't have to look it up in an index
featureSpec - TODO
Returns:
A score, usually, log P(word|tag)

sampleFrom

public java.lang.String sampleFrom(java.lang.String tag)
Samples from the distribution over words with this POS according to the lexicon.

Parameters:
tag - the POS of the word to sample
Returns:
a sampled word

sampleFrom

public java.lang.String sampleFrom()
Samples over words regardless of POS: first samples POS, then samples word according to that POS

Returns:
a sampled word

ruleIteratorByWord

public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                                            int loc,
                                                            java.lang.String featureSpec)
Description copied from interface: Lexicon
Get an iterator over all rules (pairs of (word, POS)) for this word.

Specified by:
ruleIteratorByWord in interface Lexicon
Parameters:
word - The word, represented as an integer in Index
loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
featureSpec - Additional word features like morphosyntactic information.
Returns:
An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)

ruleIteratorByWord

public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word,
                                                            int loc,
                                                            java.lang.String featureSpec)
Description copied from interface: Lexicon
Same thing, but with a string that needs to be translated by the lexicon's word index

Specified by:
ruleIteratorByWord in interface Lexicon

numRules

public int numRules()
Returns the number of rules (tag rewrites as word) in the Lexicon. This method isn't yet implemented in this class. It currently just returns 0, which may or may not be helpful.

Specified by:
numRules in interface Lexicon
Returns:
The number of rules (tag rewrites as word) in the Lexicon.

readData

public void readData(java.io.BufferedReader in)
              throws java.io.IOException
Description copied from interface: Lexicon
Read the lexicon from the BufferedReader in the format written by writeData. (An optional operation.)

Specified by:
readData in interface Lexicon
Parameters:
in - The BufferedReader to read from
Throws:
java.io.IOException - If any I/O problem

writeData

public void writeData(java.io.Writer w)
               throws java.io.IOException
Description copied from interface: Lexicon
Write the lexicon in human-readable format to the Writer. (An optional operation.)

Specified by:
writeData in interface Lexicon
Parameters:
w - The writer to output to
Throws:
java.io.IOException - If any I/O problem

isKnown

public boolean isKnown(int word)
Description copied from interface: Lexicon
Checks whether a word is in the lexicon.

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as an int
Returns:
Whether the word is in the lexicon

isKnown

public boolean isKnown(java.lang.String word)
Description copied from interface: Lexicon
Checks whether a word is in the lexicon.

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

getUnknownWordModel

public UnknownWordModel getUnknownWordModel()
Specified by:
getUnknownWordModel in interface Lexicon

setUnknownWordModel

public void setUnknownWordModel(UnknownWordModel uwm)
Specified by:
setUnknownWordModel in interface Lexicon

train

public void train(java.util.Collection<Tree> trees,
                  java.util.Collection<Tree> rawTrees)
Specified by:
train in interface Lexicon


Stanford NLP Group