Pages

Sabtu, 21 Juni 2014

Menggunakan WordNet 3.1 RDF untuk parsing NLP bahasa Indonesia

WordNet 3.1 RDF dataset mencantumkan 155287 synsets (senses) yang dapat digunakan untuk pengenalan kata dalam natural language parsing (bahasa Indonesia).


Alhamdulillah, WordNet 3.1 RDF juga menyediakan wordnet-ontology:translation dalam bahasa Indonesia untuk sebagian synsets sehingga cukup usable.

relex-id sekarang menggunakan dataset WordNet 3.1 RDF untuk pengenalan kata dengan memperhatikan part-of-speech, dan dapat mengenali 49.333 kata kerja bahasa Indonesia. :)))

Contoh output untuk kalimat "Aku suka gajah." :

Sentence structure:
(S (PP i) (VP wn31:200675902-v (NP wn31:102506148-n)) . )


Sentence in English:

I like elephant.

Sentence in Indonesian:
Aku suka gajah.

QName yang ditampilkan kurang readable bila dibandingkan dengan namespace DBpedia. Namun kelebihannya, sense QName tersebut dapat langsung dirujuk ke WordNet:

  1. suka (v) : http://wordnet-rdf.princeton.edu/wn31/200675902-v
  2. gajah (n) : http://wordnet-rdf.princeton.edu/wn31/102506148-n

Hal ini tentunya mempermudah analisa secara linguistik, atau keperluan menampilkan kata atau kalimat dalam bahasa lain misalnya Jepang, Arab, dan sebagainya (meski tanpa mempertimbangkan grammar).

Untuk penyempurnaan selanjutnya, output structure dapat dibuat lebih readable dengan menggunakan rdfs:label untuk sense tersebut, namun hal ini tentunya membutuhkan overhead dari sisi performance maupun memory usage.

Log:

10:40:18.016 [main] INFO  id.ac.itb.ee.lskk.relexid.core.RelEx - Initializing WordNet 3.1 TDB database at /home/ceefour/wn31_tdb
10:40:18.279 [main] INFO  id.ac.itb.ee.lskk.relexid.core.RelEx - Loading verb translations...
10:40:20.189 [main] INFO  id.ac.itb.ee.lskk.relexid.core.RelEx - Loaded 49333 verb translations
10:40:20.190 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Loading LexRules from class id.ac.itb.ee.lskk.relexid.core.RelExParserTest > lumen.LexRules.xmi
10:40:20.495 [main] INFO  o.soluvas.commons.OnDemandXmiLoader - Loading XMI: lumen.LexRules.xmi from id.ac.itb.ee.lskk.relexid.core.RelExParserTest
10:40:20.560 [main] INFO  o.soluvas.commons.OnDemandXmiLoader - Loaded id.ac.itb.ee.lskk.relexid.core.impl.LexRulesImpl object from file:/home/ceefour/git/relex-id/core/target/classes/id/ac/itb/ee/lskk/relexid/core/lumen.LexRules.xmi
10:40:20.561 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Loading RelationRules from class id.ac.itb.ee.lskk.relexid.core.RelExParserTest > lumen.RelationRules.xmi
10:40:20.561 [main] INFO  o.soluvas.commons.OnDemandXmiLoader - Loading XMI: lumen.RelationRules.xmi from id.ac.itb.ee.lskk.relexid.core.RelExParserTest
10:40:20.565 [main] INFO  o.soluvas.commons.OnDemandXmiLoader - Loaded id.ac.itb.ee.lskk.relexid.core.impl.RelationRulesImpl object from file:/home/ceefour/git/relex-id/core/target/classes/id/ac/itb/ee/lskk/relexid/core/lumen.RelationRules.xmi
10:40:20.566 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Tokens: [Aku,  , suka,  , gajah, .]
10:40:20.573 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Element id.ac.itb.ee.lskk.relexid.core.impl.LiteralMatcherImpl@38f1cd74 (literals: [aku], caseSensitive: false) match for #0: 'Aku'
10:40:20.576 [main] WARN  id.ac.itb.ee.lskk.relexid.core.RelEx - PartOfSpeech matcher for verb 'suka' chose the first sense {http://wordnet-rdf.princeton.edu/wn31/}200675902-v but matched 7 senses: [wn31:200675902-v, wn31:201779085-v, wn31:201780389-v, wn31:201781131-v, wn31:201780873-v, wn31:201832678-v, wn31:201828678-v]
10:40:20.576 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Element id.ac.itb.ee.lskk.relexid.core.impl.PartOfSpeechMatcherImpl@57ab648b (partOfSpeech: verb) match for #2: 'suka'
10:40:20.576 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Element id.ac.itb.ee.lskk.relexid.core.impl.LiteralMatcherImpl@35560ea4 (literals: [kamu], caseSensitive: false) not match for #4: 'gajah'
10:40:20.576 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Rule id.ac.itb.ee.lskk.relexid.core.impl.LexRuleImpl@24db4c57 not match for [Aku,  , suka,  , gajah, .]
10:40:20.576 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Element id.ac.itb.ee.lskk.relexid.core.impl.LiteralMatcherImpl@3970f6a8 (literals: [aku], caseSensitive: false) match for #0: 'Aku'
10:40:20.576 [main] WARN  id.ac.itb.ee.lskk.relexid.core.RelEx - PartOfSpeech matcher for verb 'suka' chose the first sense {http://wordnet-rdf.princeton.edu/wn31/}200675902-v but matched 7 senses: [wn31:200675902-v, wn31:201779085-v, wn31:201780389-v, wn31:201781131-v, wn31:201780873-v, wn31:201832678-v, wn31:201828678-v]
10:40:20.577 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Element id.ac.itb.ee.lskk.relexid.core.impl.PartOfSpeechMatcherImpl@2751ad0e (partOfSpeech: verb) match for #2: 'suka'
10:40:20.577 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Element id.ac.itb.ee.lskk.relexid.core.impl.ResourceMatcherImpl@6338864c (resource: wn31:102506148-n) match for #4: 'gajah'
10:40:20.587 [main] INFO  id.ac.itb.ee.lskk.relexid.core.RelEx - Rule id.ac.itb.ee.lskk.relexid.core.impl.LexRuleImpl@632b5ac2 match for [0‥4]: ['Aku', ' ', 'suka', ' ', 'gajah']
10:40:20.590 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - capturingGroup[2] = {http://wordnet-rdf.princeton.edu/wn31/}200675902-v
10:40:20.591 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - capturingGroup[3] = {http://wordnet-rdf.princeton.edu/wn31/}102506148-n
10:40:20.591 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Replacing with 2 parts at index #0: [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n))]
10:40:20.593 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Element id.ac.itb.ee.lskk.relexid.core.impl.LiteralMatcherImpl@7233d91f (literals: [.], caseSensitive: false) match for #2: '.'
10:40:20.594 [main] INFO  id.ac.itb.ee.lskk.relexid.core.RelEx - Rule id.ac.itb.ee.lskk.relexid.core.impl.LexRuleImpl@59e6d9b0 match for [2‥2]: ['.']
10:40:20.595 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Replacing with 1 parts at index #2: [.]
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 0..0 [(PP i)]
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun for (PP i): parent.matched?true children.matched?true
10:40:20.595 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun matches (PP i)
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun for (NP wn31:102506148-n): parent.matched?false
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun not matches (NP wn31:102506148-n)
10:40:20.595 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcheds 0 for matchers [pronoun] against [(NP wn31:102506148-n)]
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher verb for (VP wn31:200675902-v (NP wn31:102506148-n)): parent.matched?true children.matched?false
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher verb not matches (VP wn31:200675902-v (NP wn31:102506148-n))
10:40:20.595 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcheds 1 for matchers [pronoun, verb] against [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n))]
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 0..1 [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n))]
10:40:20.595 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 0..2 [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n)), .]
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 1..1 [(VP wn31:200675902-v (NP wn31:102506148-n))]
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun for (VP wn31:200675902-v (NP wn31:102506148-n)): parent.matched?false
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun not matches (VP wn31:200675902-v (NP wn31:102506148-n))
10:40:20.596 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcheds 0 for matchers [pronoun, verb] against [(VP wn31:200675902-v (NP wn31:102506148-n)), .]
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 1..2 [(VP wn31:200675902-v (NP wn31:102506148-n)), .]
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 2..2 [.]
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 0..0 [(PP i)]
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun for (PP i): parent.matched?true children.matched?true
10:40:20.596 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun matches (PP i)
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher noun for (NP wn31:102506148-n): parent.matched?true children.matched?true
10:40:20.596 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher noun matches (NP wn31:102506148-n)
10:40:20.596 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcheds 1 for matchers [noun] against [(NP wn31:102506148-n)]
10:40:20.596 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher verb for (VP wn31:200675902-v (NP wn31:102506148-n)): parent.matched?true children.matched?true
10:40:20.596 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher verb matches (VP wn31:200675902-v (NP wn31:102506148-n))
10:40:20.596 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcheds 2 for matchers [pronoun, verb] against [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n))]
10:40:20.596 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] matches 0..1 [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n))]
10:40:20.597 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 0..2 [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n)), .]
10:40:20.597 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 1..1 [(VP wn31:200675902-v (NP wn31:102506148-n))]
10:40:20.597 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun for (VP wn31:200675902-v (NP wn31:102506148-n)): parent.matched?false
10:40:20.597 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Matcher pronoun not matches (VP wn31:200675902-v (NP wn31:102506148-n))
10:40:20.597 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Matcheds 0 for matchers [pronoun, verb] against [(VP wn31:200675902-v (NP wn31:102506148-n)), .]
10:40:20.598 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 1..2 [(VP wn31:200675902-v (NP wn31:102506148-n)), .]
10:40:20.598 [main] TRACE id.ac.itb.ee.lskk.relexid.core.RelEx - Relation rule [pronoun verb => _subj(2, 1) || _obj(2, 2/1)] not matches 2..2 [.]
10:40:20.598 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Deduced 2 relations from 3 parts [(PP i), (VP wn31:200675902-v (NP wn31:102506148-n)), .] >> [_subj(wn31:200675902-v, I), _obj(wn31:200675902-v, wn31:102506148-n)]
10:40:20.598 [main] DEBUG id.ac.itb.ee.lskk.relexid.core.RelEx - Deduced 2 relations for sentence 'null': [_subj(wn31:200675902-v, I), _obj(wn31:200675902-v, wn31:102506148-n)]
10:40:20.598 [main] INFO  i.a.i.e.l.r.core.RelExParserTest - Sentence structure: (S (PP i) (VP wn31:200675902-v (NP wn31:102506148-n)) . )
10:40:20.598 [main] INFO  i.a.i.e.l.r.core.RelExParserTest - Sentence in English: I like elephant.
10:40:20.598 [main] INFO  i.a.i.e.l.r.core.RelExParserTest - Sentence in Indonesian: Aku suka gajah.