Enter a sentence in Dutch, English, German, or French (auto-detected). The sentence will be parsed and the most probable parse tree will be shown (show technical details; code; paper).
The Data-Oriented Parsing (DOP) framework entails constructing analyses from fragments of past experience. Double-DOP operationalizes this with a subset of fragments that occur at least twice in the training data. This demo incorporates discontinuous constituents as part of the model. Linear Context-Free Rewriting Systems (LCFRS) allow for parsing with discontinuous constituents. For efficiency, sentences are parsed with the following coarse-to-fine pipeline:
Training data:
- Split-PCFG (prune items with posterior probability < 1e-5)
- PLCFRS (prune items not in 50-best derivations)
- Discontinuous Double-DOP (use 1000-best derivations to approximate most probable parse)
Objective functions:
- English: WSJ section of Penn treebank
- German: Tiger treebank
- Dutch: Lassy, Alpino, Corpus Gesproken Nederlands (CGN)
- French: French treebank
Estimators:
- MPP: most probable parse
- MPD: most probable derivation
- MPSD: most probable shortest derivation
- SL-DOP: shortest derivation among n most probable parse trees (n=7)
- SL-DOP-simple: shortest derivation among derivations of n most probable parse trees (n=7; approximation)
- RFE: Relative Frequency Estimate
- EWE: Equal Weights Estimate
The source code is available at http://github.com/andreasvc/disco-dop/ and documented at http://andreasvc.github.io/discodop/
References:
- English, German, and Dutch parser: Andreas van Cranenburgh, Remko Scha, Rens Bod (2016). Data-Oriented Parsing with Discontinuous Constituents and Function Tags. Journal of Language Modelling, vol. 4, no. 1, pp. 57-111. http://dx.doi.org/10.15398/jlm.v4i1.100
- French parser: Federico Sangati, Andreas van Cranenburgh (2015). Multiword Expression Identification with Recurring Tree Fragments and Association Measures. Proceedings of the 11th Workshop on Multiword Expressions, pp. 10-18. http://aclanthology.org/W15-0902