Pages

Senin, 14 Juli 2014

Pemrograman Probabilistik dengan Church

Church merupakan bahasa pemrograman probabilistik dengan sintaks menyerupai Scheme/Common Lisp, dan mempunyai implementasi JavaScript yang dapat berjalan di web browser yaitu Webchurch.

Dengan Church maka melakukan sampling data probabilistik jadi lebih mudah, begitu pula untuk merepresentasikan karakter distribusi fungsi probabilistik sebagai prosedur.

Contoh program Webchurch yang saya buat untuk melakukan sampling 2 variabel berbobot sebagai berikut:

(define a (lambda () (flip 0.8)))
(define b (lambda (a) (if a (flip) (flip 0.3))))
(define ab (lambda ()
             (define A (a))
             (list A (b A))))
(hist (repeat 1000 ab) "P(A, B)")

Hasilnya:


Yee :)

Menarik sih... apakah bahasa pemrograman probabilistik ini akan bermanfaat untuk mengimplementasikan semantic reasoning maupun PLN? Belum tahu juga, sih...

OpenCog REST API Interactive Documentation powered by Swagger

OpenCog has a flask-restful powered REST API which covers the most useful OpenCog/AtomSpace operations, although currently only has a few endpoints.

I contributed Interactive REST API Documentation feature using flask-restful-swagger, which uses the Swagger specification.


In that merged pull request you can also see how the API documentation looks like. Pretty nice, eh? :-)

Selasa, 08 Juli 2014

Designing Software Around Data Grid and Compute Grid Programming Model

OpenCog discussion of impact of thread-safety and distributed processing on performance (and code structure):

On Sunday, July 6, 2014 5:51:36 PM UTC-4:30, linas wrote:
On 6 July 2014 11:51, Ramin Barati <rek...@gmail.com> wrote:

Anyway, a few years ago, the AtomSpace used the proxy pattern and it was a performance disaster. I removed it, which was a lot of hard, painful work, because it should never have been added in the first place.  Atomspace addnode/link operations got 3x faster, getting and setting truth values got 30x faster, getting outgoing sets got 30x faster.  You can read about it in opencog/benchmark/diary.txt

I can't even imagine what use has the Atomspace for the proxy pattern.

Ostensibly thread-safety, and distributed processing. Thread-safety, because everything was an "atom space request" that went through a single serialized choke point.  Distributed processing, because you could now insert a zeromq into that choke point, and run it on the network.  The zmq stuff was even prototyped, When measured, it did a few hundred atoms per second, so work stopped.

To me, it was an example of someone getting an idea, but failing to think it through before starting to code.  And once the code was written, it became almost impossible to admit that it was a failure.  Because that involves egos and emotions and feelings.

--linas


I think designing most parts of the software around data grid and compute grid constructs would allow:
  1. intuitive coding. i.e. no special constructs or API, just closures and local data, and closures are still cheap even when lots of iterations and no thread switching. e.g.
    Range(0..100000).foreach { calculateHeavily(it); }.collect( sum )

    a nice side effect of this is, a new member or hire can join a project and ideally, not having to learn the intricacies of messaging & multithreading plumbing, there's already too much logic to learn anyway without adding those glue.
  2. practical multithreading. I'm tempted to say painless multithreading :) multithreading becomes configuration, not a baked in logic that's "afraid" to be changed (in the sense that, while make a singlethreaded code to multithreaded takes time, I think it takes even more time to make it work right + no race conditions + avoid negative scaling.. then when all else fails you return it back to the original code, while other developments are done concurrently)
  3. messaging. even the simple code above implies messaging, i.e. logically it is sending "int i" into the closure. at runtime it can be singlethread, multithread, or multinode. if calculateHeavily is in nanoseconds then it's pointless to use multithread. but if calculateHeavily takes more than 1 second multinode is probably good.
  4. data affinity. the data passed to the closure doesn't have to be "data"/"content", it can be a key, which the closure can then load locally and process and aggregate.
    findAtomsWhichAre(Person).foreach( (personId) -> { person = get(personId); calculateHeavily(person); }.collect( stats )

    I haven't seen the ZeroMQ implementation of AtomSpace, but I'm suspecting it is (was?) a chatty protocol, and would've been different if data affinity was considered. I only ever used AMQP, but I think both ZeroMQ and AMQP are great to implement 
    messaging protocol instead of scaling, unless explicitly designed as such like how Apache Storm uses ZeroMQ.
  5. caching... consistently & deterministically. one way to reduce chattiness is to cache data you already have, that can be reused on next invocations (which hopefully many, otherwise too many cache misses will be worse). the problem is cache becomes stale. the solution is to use the data grid model, meaning the cache is consistent at all times. While this means there is a bit more coupling between cache and database, the programming model stays the same (as point #1) and no need for stale cache checks.
  6. caching... transactionally. manually implementing atomicity is hard in using messaging/RPC/remote proxy, or even basic multithreading. a data grid framework allows transactional caching, so
    ids.foreach { a = get(id) -> process(a) -> put(id, a) }
    would not step on some other operation.
  7. performance testing for several configurations. if there are performance unit tests for a project, these can be in multiple configs: 1 thread; n threads; 2 nodes × n threads.
    this ideally achieves instant gratification. if a performance unit test has negative scaling, you can notice it earlier. and if it does approach linear scaling, congrats & have a beer :D
  8. bulk read & write. related to #5, if there are lots of scattered writes to database, a cache would improve this using write-through, while maintaining transactional behavior. instead of 100 writes of 1 document each, the cache can bulk-write 1 database request of 100 documents. you may let the framework do it or may code bulk write manually in certain cases, there's the choice.
  9. bulk messaging. related to #3 and #4. a straightforward messaging protocol may divide 100 operations into send 50 messages to node1 and 50 messages to node2, which may be significant overhead. a compute grid can divide 100 operations into 2 messages: 1 message of 50 operations to node1 and 1 message of 50 operations to node2.
  10. avoid premature optimization, while allowing both parallelization and optimization. related to point #7, since you know you can change config at any time, if you notice negative scaling you can simply set the config for that particular cache or compute to single-thread or single-node. Implementing #2 + #3 manually sometimes actually hinders optimization.
While my reasons above pertain to performance, I'm not suggesting considering performance while coding, but I do suggest considering a programming model which allow you to implement performance tweaks unobtrusively and even experimentally while retaining sanity. (that last part seems to be important at times :)

For example, using GridGain one uses closures or Java8 lambdas or annotated methods to perform compute, so things aren't that much different. To access data one uses get/put/remove/queries abstraction and you usually abstract these anyway, but now you have the option to make these read-through and write-through caches instead of direct to database. Data may be served from a remote/local database, a remote grid node, or local memory, this is abstracted. The choice is there, but it doesn't have to clutter code, and can be either implemented separately a in different class (Strategy pattern) or configured declaratively.

For Java it's easy to use GridGain, I believe modern languages like Python or Scheme also have a way to achieve similar programming model. Although if for C++ then I can't probably say much.

Personally I'd love for a project to evolve (instead of rewrite), i.e. refactoring different parts over versions while retaining general logic and other parts. This way not only retains code (and source history), but more importantly team knowledge sharing. In a rewrite it's hard not to repeat the same mistake, not to mention the second-system effect.

In my experience with Bippo eCommerce, our last complete rewrite was 2011 when we switched from PHP to Java. From that to the present we did several major architectural changes, as well as frameworks: JSF to quasi-JavaScript to Wicket, Java EE to OSGi to Spring, Java6 to Java7 to Java8, LDAP to MongoDB, MySQL to MongoDB to MongoDB + PostgreSQL, and so on... sure we had our share of mistakes but the valuable part is we never rewrite the entire codebase at once, we deprecated and removed parts of codebase as we go along. And the team retains collective knowledge of the process, i.e. dependency between one library and another, and when we change one architecture the other one breaks. I find that very beneficial.

Performance Unit Testing to Reduce Friction in Software Evolution

Dr. Linas Vepstas' concern regarding premature distribution architecture in OpenCog AtomSpace that hinders performance optimization:

Anyway, a few years ago, the AtomSpace used the proxy pattern and it was a performance disaster. I removed it, which was a lot of hard, painful work, because it should never have been added in the first place.  Atomspace addnode/link operations got 3x faster, getting and setting truth values got 30x faster, getting outgoing sets got 30x faster.  You can read about it in opencog/benchmark/diary.txt

--linas


With regard to performance, I think it's probably useful to have a performance-unit-test for each project. Just like unit tests but for timing iterations of operations with known data, instead of correctness.
Doesn't have to be lots of data, just a few curated set of input data that's somewhat representative.


This way development (and development branches) will have a history of simple performance metrics, on a regular basis.

The intention is not to encourage premature optimization, but to give early warning when a code restructure/refactoring might impact performance negatively.
With that out of the way, the developer is free-er to explore software design options, test feature branches, so that internal structures & implementation stay "sane".

(it would imply that the tests themselves would have maintenance cost and this would be alleviated by having a relatively-stable public API that the tests can rely on to.)

Natural Language Processing - Note from Dr. Linas Vepstas

The process should be computationally feasible.  We are writing code for it. However, there's an endless list of practicalities, everything from a labor shortage to buggy code, missing infrastructure, poorly expressed and misunderstood ideas, as well as plenty of open questions and research to be done.

Practically, at this time, the big road-blocks are:

1. not having a fully-functional PLN, and
2. not having a large database of common-sense experience/knowledge.

Senin, 07 Juli 2014

Distributing and Parallelizing Probabilistic Logic Networks Reasoning

During discussion about making Probabilistic Logic Networks (PLN) allow parallel reasoning using distributed AtomSpace, Dr. Ben Goertzel noted:
As for parallelism, I believe that logic chaining can straightforwardly be made parallel, though this involves changes to the algorithms and their behavior as heuristics.   For example, suppose one is backward chaining and wishes to apply deduction to obtain A --> C.   One can then evaluate multiple B potentially serving the role here, e.g.

A --> B1, B1 --> C  |-  A -->C
A --> B2, B2 --> C  |-  A -->C
...

Potentially, each Bi could be explored in parallel, right?    Also, in exploring each of these, the two terms could be backward chained on in paralle, so that e.g.

A --> B1

and

B1 --> C

could be explored in parallel...

In this way the degree of parallelism exploited by the backward chainer would expand exponentially during the course of a single chaining exploration, until reaching the natural limit imposed by the infrastructure.

This will yield behavior that is conceptually similar, though not identical, to serial backward chaining.
I haven't learned much about PLN yet, but I hope it can be made to work by distributing computation across AtomSpace nodes.

Other than strictly parallel, each path can be assigned a priority or heuristic, the tasks become a distributed priority queue. So a task finishing earlier can insert more tasks into the prority queue, and these new tasks don't have to be at the very end, but can be in the middle etc.


If we'd like to explore 1000 paths, and each of those generates another 1000 paths, ideally we'd have 1 million nodes. The first phase will be executed by 1000 nodes in parallel, the next 1 million paths will be executed by 1 million nodes in parallel, and we have a result in 2 ms. :) In practice we probably only can afford a few nodes, and limited time, can PLN use heuristic discovery?

For example, if the AI participates in "Are you smarter than a 5th grader", the discovery paths would be different than "calculate the best company strategy, I'll give you two weeks and detailed report". In a quiz, the AI would need to come up with a vague answer quickly, then refine the answer progressively until time runs out. i.e. when requested 10 outputs, the quiz one will try to get 10 answers as soon as possible even if many of them are incorrect; and the business one will try to get 1 answer correct, even if it means the other 9 is left unanswered.

Does PLN do this? If so, the distributed AtomSpace architecture would evolve hand-in-hand with (distributed) PLN. An app or modules shouldn't be required to be distributed to use AtomSpace, however a module (like PLN) that's aware that AtomSpace is both a distributed data grid and a distributed compute grid, can take advantage of this architecture and make its operations much faster/scalable. It's akin to difference between rendering 3D scenes by CPU vs. using OpenGL-accelerated graphics. However, a computer usually have only 1 or 2 graphics card and fixed, where an AtomSpace cluster can have dynamic number of nodes and you can throw more at it at any time. i.e. for expensive computation you can launch 100 EC2 instances for several hours then turn it off when done.

Adding Distributed Indexes to Hypergraph Database for Horizontal Scaling of Semantic Reasoning

While discussing distributed AtomSpace architecture in OpenCog group, Dr. Linas Vepstas noted:
Reference resolution, reasoning and induction might be fairly local as well:  when reading and trying to understand a wikipedia article, it seems as if its related to a million different things.   A single CPU with 16GB RAM can hold 100 million atoms in RAM, requiring no disk or network access.   

The only reason for a database then becomes as a place to store, over long periods of time, the results of the computation.  Its quite possible that fast performance of the database won't actually be important.  Which would mean that the actual database architecture might not be very important.  Maybe.
Based on the experiments, while processing (i.e. reasoning) 200,000 "atoms" in 3 seconds on a single host isn't too bad, searching for a few atoms out of 200,000 (or even 1 billion) on single host should take very fast (i.e. ~ 1 ms or less).

So I guess these are two distinct tasks. Searching would use (distributed) indexing, while processing/reasoning can be done by MindAgents combining data-to-compute and compute-to-data, with consideration to data affinity.

For processing which requires non-local data that Dr. Vepstas concerned, when using compute+data grid such as GridGain, a compute grid is automatically a cache, so all required non-local data are automatically cached. Which may or may not be sufficient, depending on the algorithm.


For searches, it seems we need to create separate indexes for each purpose, each index is sharded/partitioned appropriately to distribute compute load. Which means AtomSpace data grid is will have redundancy in many ways. The AtomSpace can probably be "split" into 3 parts:
  1. the hypergraph part (can be stored in HyperGraphDB or Neo4j)
  2. the eager index parts, always generated for the entire hypergraph, required for searches (can be stored in Cassandra or Solr or ElasticSearch)
  3. the lazy index parts, the entries are calculated on demand then stored for later usage (can be stored in Cassandra or Solr or ElasticSearch)
The hypergraph would be good when you already know the handles, and for traversing. But when the task is "which handles A are B of the handles C assuming D is E?" an index is needed to answer this (particular task) quickly. Hopefully ~1 ms for each grid node, so 100 nodes working in parallel, will generate 100 set of answers in the ~1 ms.

Today, a 16 GB RAM node with 2 TB SATA storage is probably typical config (SSD will also work, but just for the sake of thought experiment a spinning disk more performance concerns). The node holds a partition of the distributed AtomSpace, and is expected to answer any search (i.e. give me handles of atoms in your node where it matches criteria X, Y, Z) within 1ms, and can do processing over a select nodes (i.e. for handles [A, B, C, ... N] perform this closure) within 1 second.

To achieve these goals:
  1. For quick searches for that partition, all atom data needs to be indexed in multiple ways, an index for each purpose
  2. For quick updates to the index (triggered by updates to data), the index and data are colocated in the same host to avoid network IO, although can be in different stores (i.e. data in HyperGraphDB and index in Cassandra). The partitioning/sharding need to accomodate this. So for 2 TB storage, we can put perhaps 100 GB data and 1 TB of indexes.
  3. For quick lookup and updates of subset of data, the RAM is used as read-through & write-through cache by the data grid.
  4. For non-local search/update/lookup/processing, it uses the data grid to do so, and caches results locally in RAM, that can overflow to disk. We still have 900 GB of space left, so we can use it for this purpose.
  5. For quick processing of subset of data, local lookups are performed (which should take near-constant time, even with drives) and much faster if requested data is already in cache. Processing is then done using CPU or GPGPU (via OpenCL, e.g. Encog neural network library uses OpenCL to accelerate calculations). Results are then sent back via network.
For question-answering, given the label (e.g. Ibnu Sina) and possible concept types (Person), and optionally discussion contexts (Islam, religion, social, medicine), find the ConceptNode's which has that label, that type, and the confidence value for each contexts. And I want it done in 1 ms. :D

YAGO has 15,372,313 labels (1.1 GB dataset) for 10+ million entities. The entire YAGO is 22 GB. Assuming the entities with labels are stored in AtomSpace, selecting the matching labels without index would take ~150 seconds on a single host and ~50 seconds on 3 nodes (extrapolating my previous results). With indexes this should be 1ms.

First index would give the concepts given a label and types, with structure like :

label -> type -> [concept, concept, concept, ...]
            type -> [concept, concept, concept, ...]
            type -> [concept, concept, concept, ...]

Second index would give the confidence, given a concept and contexts, with sample data like :

Ibnu_Sina1 -> { Islam: 0.7, medicine: 0.9, social: 0.3, ... }
Ibnu_Sina2 -> { Islam: 0.1, medicine: 0.3, social: 0.9, ... }

Indexes change constantly, for each atom change multiple indexes must be updated, and index updates would take more resources than updating the atoms themselves, so index updates are asynchronous and eventually consistent. (I guess this also happens on humans, when humans learn new information, they don't immediately "understand" it. I mean, we now know a new fact, but it takes time [or even sleep] to make sense or implications/correlations of that new fact.)

We should agree on a set of a priori indexes. (As new concepts are learned and OpenCog gets queries that take a long time processing too many atoms, the AI may learn to make new indexes or tune existing ones... although this is probably too meta and distant future. :D )

Experimental Performance Test using GridGain for Distributed Natural Language Processing

I did an experimental performance test using GridGain to simulate AtomSpace processing. This is related to discussion in OpenCog group about AtomSpace architecture.

Disclaimer: This is not a benchmark, please don't treat it as such!



First I loaded up 212,351 YAGO labels (from MongoDB, but the actual backend doesn't matter here) for resources starting with letter M :

13:13:43.178 [main] INFO  i.a.i.e.l.l.yago.YagoLabelCacheStore - Loading 212351 labels...
13:13:45.595 [main] DEBUG i.a.i.e.l.l.yago.YagoLabelCacheStore -   [23%] 50000 labels loaded, 162351 more to go...
13:13:47.571 [main] DEBUG i.a.i.e.l.l.yago.YagoLabelCacheStore -   [47%] 100000 labels loaded, 112351 more to go...
13:13:49.139 [main] DEBUG i.a.i.e.l.l.yago.YagoLabelCacheStore -   [70%] 150000 labels loaded, 62351 more to go...
13:13:50.608 [main] DEBUG i.a.i.e.l.l.yago.YagoLabelCacheStore -   [94%] 200000 labels loaded, 12351 more to go...
13:13:50.914 [main] INFO  i.a.i.e.l.l.yago.YagoLabelCacheStore - Loaded 212351 labels...
13:13:50.917 [main] INFO  id.ac.itb.ee.lskk.lumen.yago.Worker - For yagoLabel, I have 100000 primary out of 100000 entries + 112351 swap

To make it somewhat more realistic, grid data for a node is capped at 100,000 entries. The configuration is partitioned, so for all 3 nodes then the entire dataset should be held entirely in memory. Then I started two more nodes, and the latest node does a search which resource ID has the label "Muhammad". So it's basically a reverse hashmap lookup, that can be perfectly be done using an index. But I'm treating the entries as atoms, just for the sake of doing distributed-parallel computation on them.

Collection<Set<String>> founds = labelCache.queries().createScanQuery(null).execute(new GridReducer<Entry<String, String>, Set<String>>() {
Set<String> ids = new HashSet<>();
@Override
public boolean collect(Entry<String, String> e) {
if (e.getValue().equalsIgnoreCase(upLabel)) {
ids.add( e.getKey() );
}
return true;
}
@Override
public Set<String> reduce() {
return ids;
}
}).get();

The results, using my workstation i5-3570K @ 4x 3.40GHz, 3 nodes at 1 GB heap each:

[13:29:40] GridGain node started OK (id=03a07172)
[13:29:40] Topology snapshot [ver=5, nodes=3, CPUs=4, heap=3.0GB]
13:29:40.043 [main] INFO  i.a.i.e.l.l.yago.YagoLabelLookupCli2 - Finding resource for label 'Muhammad'...
13:29:43.131 [main] INFO  i.a.i.e.l.l.yago.YagoLabelLookupCli2 - Found for Muhammad: [[Muhammad_Khalil_al-Hukaymah, Muhammad_S._Eissa, Muhammad_Musa, Muhammad_Okil_Musalman, Muhammad_Loutfi_Goumah, Muhammad_Sadiq, Muhammad_Salih, Muhammad_Ismail_Agha, Muhammad_Yusuf_Hashmi, Mustafah_Muhammad, Muhammad_Mahbubur_Rahman, Muhammad_Ahmad_Said_Khan_Chhatari, Muhammad_Jamiruddin_Sarkar, Muhammad_Ibrahim_Joyo, Muhammad_bin_Tughluq, Muhammad_Sohail_Anwar_Choudhry, Muhammad_Tariq_Tarar], [Muhammad_Salman, Muhammad_Jailani_Abu_Talib, Muhammad_Qutb], [Muhammad_Ibrahim_Kamel, Muhammad_Amin_Khan_Turani, Muhammad_Ali_Pate, Muhammad_Rafi_Usmani, Muhammad_Faisal, Muhammad, Muhammad_Ilham, Muhammad_Kurd_Ali, Muhammad_Umar, Muhammad_Shahidullah, Muhammad_Anwar_Khan, Muhammad_Saifullah, Muhammad_Saqlain]]
[13:29:43] GridGain node stopped OK [uptime=00:00:03:865]

Searched 212,351 entries in 3088 ms, using 3 nodes × 4 threads = 12 total threads on single host. So the rate is ~68766 entries/second.

To be fair, GridGain is giving performance hints: (so for serious benchmark, these should be tuned)

[13:29:40]   ^-- Decrease number of backups (set 'keyBackups' to 0)
[13:29:40]   ^-- Disable fully synchronous writes (set 'writeSynchronizationMode' to PRIMARY_SYNC or FULL_ASYNC)
[13:29:40]   ^-- Enable write-behind to persistent store (set 'writeBehindEnabled' to true)
[13:29:40]   ^-- Disable query index (set 'queryIndexEnabled' to false)
[13:29:40]   ^-- Disable peer class loading (set 'peerClassLoadingEnabled' to false)
[13:29:40]   ^-- Disable grid events (remove 'includeEventTypes' from configuration)

Of course, 12 threads running on a single host isn't optimal, and there's no network saturation effects since all nodes are on the same host.

From how GridGain works, the performance should be (much?) better when there are 3 actual nodes/processors to work on. The key thing is that the calculation (map/reduce) is done on each node, so the "Mind Agent" (node 3) here only does roughly ~33% of the job, the other 2 "AtomSpace" nodes aren't just serving data, they're also processing data they already have, no need to move these bits around the network.

Since the closure is code (Java code), it's possible to use OpenCL/GPU for certain tasks, which should increase performance for math-intensive processing.

Fault tolerance also works very well, so you can kill and rearrange nodes at will, the grid will stay there as long as at least 1 node is up.

Distributed Natural Language Parsing using GridGain as Compute and Data Grid

Discussion in OpenCog group about AtomSpace architecture. Dr. Ben Goertzel notes:
Section 5.3 of my distributed AtomSpace design from June 2012 
http://wiki.opencog.org/wikihome/images/e/ea/Distributed_AtomSpace_Design_Sketch_v6.pdf

is titled "Importance Dynamics" and deals with problem of handling STI and LTI (attention) values in a distributed OpenCog system....  It is brief and only gives a general approach, as I figured it would be best to work out the details after the distributed Atomspace was in the detailed design phase.  Recall that document was written after long discussions with you and others. 
I've been experimenting with GridGain and it seems to be ticking most if not all of the performance requirements you need, plus with Neo4j as the persistent graph store which would allow intuitive querying and visual exploring of the AtomSpace.





Currently I have 34 rules (imagine that this is the number of Atoms). The core to process them is: (Java8)

Collection<GridFuture<MatchedYagoRule>> matchers = Collections2.transform(ruleIds, (ruleId) ->
grid.compute().affinityCall(cache.name(), ruleId, 
new GridCallable<MatchedYagoRule>() {
@Override
public MatchedYagoRule call()
throws Exception {
final YagoRule rule = cache.get(ruleId);
Pattern pattern = Pattern.compile(rule.questionPattern_en, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(msg);
if (matcher.matches()) {
log.info("MATCH {} Processing rule #{} {}", matcher, ruleId, rule.property);
return new MatchedYagoRule(rule, matcher.group("subject"));
} else {
log.info("not match Processing rule #{} {}", ruleId, rule.property);
return null;
}
}
}) );

which probably needs explanation for someone unfamiliar with in-memory datagrid, but the whole experiment does very sophisticated things for very little code / setup, and it will be scalable (I can only find this 2010 article for comparison benchmark, but I'm sure today GridGain is much improved).

How it works is it distributes the compute task (triggered by node2) for 34 rules across nodes and threads (cores) inside each node. For this example I use 2 nodes in the same machine, the output for node2 is:

...
06:18:46.470 [gridgain-#5%pub-null%] INFO  i.a.i.e.l.l.yago.AnswerYagoFactTests - not match Processing rule hasHeight How tall is (?<subject>.+)\?
06:18:46.473 [gridgain-#7%pub-null%] INFO  i.a.i.e.l.l.yago.AnswerYagoFactTests - not match Processing rule hasEconomicGrowth How much is the economic growth of (?<subject>.+)\?
06:18:46.477 [gridgain-#6%pub-null%] INFO  i.a.i.e.l.l.yago.AnswerYagoFactTests - not match Processing rule isMarriedTo Who did (?<subject>.+) marry\?
06:18:46.485 [gridgain-#10%pub-null%] INFO  i.a.i.e.l.l.yago.AnswerYagoFactTests - Found matcher: MatchedYagoRule [rule=YagoRule [property=wasBornIn, questionPattern_en=Where was (?<subject>.+) born\?, questionPattern_id=Di mana (?<subject>.+) dilahirkan\?, answerTemplateHtml_en={{subject}} was born in {{object}}., answerTemplateHtml_id={{subject}} lahir di {{object}}.], subject=Michael Jackson]
06:18:46.485 [gridgain-#10%pub-null%] INFO  i.a.i.e.l.l.yago.AnswerYagoFactTests - Subject: MatchedYagoRule [rule=YagoRule [property=wasBornIn, questionPattern_en=Where was (?<subject>.+) born\?, questionPattern_id=Di mana (?<subject>.+) dilahirkan\?, answerTemplateHtml_en={{subject}} was born in {{object}}., answerTemplateHtml_id={{subject}} lahir di {{object}}.], subject=Michael Jackson]
[06:18:46] GridGain node stopped OK [uptime=00:00:00:989]

However the match actually didn't happen in node2, it actually happened in node1:

06:18:46.436 [gridgain-#7%pub-null%] INFO  i.a.i.e.l.l.yago.AnswerYagoFactTests - MATCH java.util.regex.Matcher[pattern=Where was (?<subject>.+) born\? region=0,31 lastmatch=Where was Michael Jackson born?] Processing rule wasBornIn Where was (?<subject>.+) born\?

node1 and node2 holds different partitions of the 34 rules. So what happens is node2 as that triggers the job, will distribute the job (literally sending the Java closure bytecode over network) to other nodes, based on affinity to the requested rule. node1 will process that closure/job over entries/rules that it holds. In my example node2 also does the same, since it also holds a partition of the rules, but it doesn't have. All jobs send the result (map), which will then be reduced, and we get the output.

Also, the rules are held in persistent storage, which in my simple case is actually from CSV file. In reality this would be a data store such as Neo4j or PostgreSQL or Cassandra. Meaning that the maximum AtomSpace capacity is equal to sum of harddrives (depending on replication factor). And during processing, each node's RAM will be utilized to process the data it has closest/based on affinity.

We get several nice properties:
  • distribution/partitioning of data, which means:
  • increased storage space, and
  • distribution of compute, which is enabled by
  • affinity-based computation, i.e. a node processes request based on the atoms it already has
  • with pluggable persistent storage, the "API" so-to-speak to process atoms remain the same, even if we process 100 GB of (total) atoms with only 4 GB of (total) RAM. since GridGain will manage the read-through & write-through based on the GridCacheStore implementation
  • GridGain allow indexes on data (which work in-memory), if used can complement the datastore indexes and provide flexible querying (i.e. other than fetching keys) while retaining performance
  • GridGain is Apache Licensed, with commercial support :) companies using OpenCog have option of GridGain's consulting & commercial support to tune their system

Neo4j as Graph Database for OpenCog AtomSpace architecture?

Why a triplestore? Why not just a plain-old ordinary no/coSQL DB?  (e.g. casandra, mongodb couchdb, redis, whatever) Is there something a triplestore can do better?
Also: I will say this again an again: the current opencog system scales quite well to at least 5 nodes.  I suspect that, without much work, it could scale to orders of magnitude more.   I think we should learn how to walk before we try to run.
As far a I know, I am the only person who has ever even attempted to run opencog on more than one machine.  Its not hard ...
Anyway, I think there's another important task to consider, before scalability:  the ability to remember things, and to manage and control what is remembered.  Yes, LTI and VLTI is supposed to do this automatically, but, again, as far as I know, no one besides myself has ever taught something to an opencog system, then checkpointed or hibernated or 'cryogenicaly froze' the system, and then revived it to do more work.   This is not hard in principle, but tricky in practice because its easy to save too much or too little.
Yes, I agree, there's a certain set of issues that could potentially dog the current atomspace design. However, I don't think that any of us has sufficient experience with the system to say what they are, and so trying to design a brand new system that slays some imagined dragons is a distraction: the real dragons are almost always somewhere else.
I agree with Dr. Vepstas here. I also agree with Dr. Vepstas' other point with regard to PostgreSQL.

I have experience with MongoDB, PostgreSQL, CouchDB, Jena TDB, Neo4j, and Cassandra so I'd like to share.

At one end of this spectrum (w.r.t query mechanism) are PostgreSQL, then MongoDB in the middle, and CouchDB then Cassandra at the other end.

Jena TDB and Neo4j is different as they're graph databases. Jena TDB is an RDF quadstore (with named graph support) so I believe Systap's is similar here.

My vote goes to Neo4j, but PostgreSQL is a good contender too, as I'll elaborate below.

I'd eliminate Cassandra and CouchDB. Cassandra's architecture requires schema-first approach to querying, which means in practice you need to write the same data into multiple "tables" or column families at once, and deal with updates in complex ways.

CouchDB is somewhat similar with addition that it has views as semi-convenient mechanism to structure queries, but querying like what Dr. Goertzel described is still going to be painful.

Cassandra and CouchDB's strengths are in horizontal scalability and multi-master replication, which could be useful in distributed Mind-Agents architecture. (but I think this would better be a separate concern)

The second I'd like to eliminate *for this phase* are RDF triple/quadstores like Jena TDB and Systap. BTW I'm a proponent of RDF and I'm using Linked Data technologies in my masters thesis Lumen Robot Friend Knowledge Base.
The reason is to use a triplestore you need to structure your data like RDF and then you need to think SPARQL. Oh and indexes.

One main issue is reification because RDF isn't n-ary. So reification isn't recommended anyway and nobody uses it so the best approach is singleton properties, like what YAGO2s does by Prof. Fabian Suchanek.

I think later on usage of RDF store can be revisited given its benefits (ecosystem, Linked Data Platform, standard everything like Turtle, SPARQL, tools, etc.) but at this point you'll spend too much effort trying to (retro)fit your data model into RDF world.

MongoDB is general-purpose enough and has good scalability, but I think storing OpenCog's AtomSpace structure won't use the main benefit of MongoDB which is flexible document structure and nested subdocuments. The cons to MongoDB are aggregation is not so intuitive and no support for joins.

I assume PostgreSQL needs no introduction, but would like to remind that it supports key-value columnJSONwindow functionsmaterialized views, and function based indexes. I think Instagram has proven that (with proper data modeling) PostgreSQL can scale. And if that isn't enough for scaling, there's the horizontally scalable Postgres-XL.

Neo4j in many ways similar to RDF store, but instead of statements (triples), Neo4j uses nodes, relationships/links, and properties inside both nodes and relationships. Which makes RDF reification unnecessary in most situations. Neo4j also has much more convenient query language called Cypher (compared to SPARQL) :

# which movies do I rate, how many stars, what comment?
MATCH (me:User {name: "Me"}) -[r:RATED]-> (movie)
RETURN r.stars, r.comment, movie.title;

For Dr. Goertzel's example:

ATOMSPACE VERSION:

EvaluationLink <.9,.2>
   eat
   (cat, mouse)

I'd like to suggest an alternative triplestore version as follows: (pseudo-Turtle format)

TRIPLESTORE VERSION:

eat a PredicateNode,
      rdf:Property
cat a ConceptNode
mouse a ConceptNode

cat eat#1 mouse

eat#1 a EvaluationLink
eat#1 singletonPropertyOf eat
eat#1 hasStrength .9
eat#1 hasConfidence .2

I put "evaluationLink" as lowercase since I assumed it's treated as an RDF predicate instead of an RDF Class.
I use singleton property method for reification of evaluationLink relationship.
Note that I "cheated" here by mapping ListLink directly into subject-predicate-object triple.
Ordered lists are difficult to express conveniently in RDF triples, and this is generally the case in graph databases. Sets are much easier.
I'd recommend changing EvaluationLink structure to "(EvaluationLink truthvalue predicate subject object)" which is easier to map to databases (and to OO classes), and removing ListLink's whenever possible.

In Neo4j it'd be structured as: (valid Cypher query)

CREATE
(eat:PredicateNode {name: "eat"}),
(cat:ConceptNode {name: "cat"}),
(mouse:PredicateNode {name: "mouse"}),
(eat1:EvaluationLink {name: "eat1", strength: 0.9, confidence: 0.2}),
(eat1)-[:PREDICATE]->(eat),
(eat1)-[:SUBJECT]->(cat),
(eat1)-[:OBJECT]->(mouse)

The really nice thing (at least for starters) about Neo4j is instant gratification. You can get the following visualization simply by doing the equivalent of "SELECT * FROM table". I believe even someone unfamiliar with OpenCog can grasp the essence of this graph:



You can actually play with our sample graph above online here :

Again for this sample the ListLink is collapsed into distinct SUBJECT and OBJECT relationships, which is easier to model and more intuitive. It's possible to model ListLink in Neo4j using relationship properties but it doesn't look as pretty for this example. :) But it is usable for other cases where an actual ordered list is required.

For above example, with PostgreSQL I'd create predicatenode and conceptnode tables, and find a way to structure evaluationlink taking consideration the polymorphism and dynamic nature of AtomSpace (assuming we want to use foreign keys & joins). It's not as straightforward as Neo4j, but PostgreSQL has features like hstore that make it flexible. Also with full control of indexes, views, etc. should be possible to tweak its performance for large datasets.

Kamis, 03 Juli 2014

Knowledge Base YAGO2s untuk Uji Pengetahuan Robot

Semantic knowledge base YAGO2s akan digunakan sebagai data fakta untuk Lumen Knowledge Base.


Agar pengembangan aplikasi terarah dan evaluasinya terukur, maka perlu membuat data uji.

Untuk FitNesse acceptance testing nantinya, beberapa contoh data uji yang dihasilkan sebagai berikut, berupa pasangan pertanyaan dan jawaban dalam dua bahasa, Inggris dan Indonesia. Ini akan menguji dari kapabilitas baik dari segi language detectionnatural language parsing, natural language generation, localization, dan semantic query untuk fakta langsung (bukan inference, tanpa reasoning).

English Bahasa Indonesia
What is the airport code of Freeman Municipal Airport?
  Airport code of Freeman Municipal Airport is 'SER'.
Apa kode bandara Freeman Municipal Airport?
  Kode bandara Freeman Municipal Airport adalah 'SER'.
What is Huayna Picchu's latitude?
  Huayna Picchu's latitude is -13.158°.
Berapa lintang Huayna Picchu?
  Lintang Huayna Picchu adalah -13,158°.
When was Vampire Lovers destroyed?
  Vampire Lovers was destroyed on year 1990.
Kapan Vampire Lovers dihancurkan?
  Vampire Lovers dihancurkan pada tahun 1990.
What is the gini index of Republica De Nicaragua?
  Gini index of Republica De Nicaragua is 52.3%.
Berapa indeks gini Republica De Nicaragua?
  Indeks gini Republica De Nicaragua adalah 52,3%.
What movies did Anand Milind write the music for?
  Anand Milind wrote music for Jeevan Ki Shatranj.
Anand Milind menciptakan lagu untuk film apa?
  Anand Milind menciptakan lagu untuk Jeevan Ki Shatranj.
How many people live in Denton, Montana?
  Denton, Montana's population is 301 people.
Berapa populasi Denton, Montana?
  Populasi Denton, Montana adalah 301 orang.
How much is the GDP of Беларусь?
  GDP of Беларусь is $55,483,000,000.00.
Berapa PDB Беларусь?
  PDB Беларусь adalah USD55.483.000.000,00.
Where is House of Flora's website?
  House of Flora's website is at http://houseofflora.bigcartel.com/products.
Di mana alamat website House of Flora?
  Alamat website House of Flora ada di http://houseofflora.bigcartel.com/products.
How much is the revenue of Scientific-Atlanta?
  Revenue of Scientific-Atlanta is $1,900,000,000.00.
Berapa pendapatan Scientific-Atlanta?
  Pendapatan Scientific-Atlanta adalah USD1.900.000.000,00.
Who are the children of Rodney S. Webb?
  Children of Rodney S. Webb are Todd Webb.
Siapa saja anak Rodney S. Webb?
  Anak Rodney S. Webb adalah Todd Webb.
What is the currency of Kyrgzstan?
  Currency of Kyrgzstan is Kyrgyzstani som.
Apa mata uang Kyrgzstan?
  Mata uang Kyrgzstan adalah Kyrgyzstani som.
Where did Siege of Candia happen?
  Siege of Candia happened in Ηράκλειο.
Di mana Siege of Candia terjadi?
  Siege of Candia terjadi di Ηράκλειο.
What is the citizenship of Amanda Mynhardt?
  Amanda Mynhardt is a citizen of Republic of South Africa.
Amanda Mynhardt warganegara mana?
  Amanda Mynhardt adalah warganegara Republic of South Africa.
When was Stelios born?
  Stelios was born on Tuesday, November 15, 1977.
Kapan Stelios dilahirkan?
  Stelios lahir pada Selasa 15 November 1977.
Where did Henry Hallett Dale die?
  Henry Hallett Dale died in Grantabridge.
Di mana Henry Hallett Dale meninggal dunia?
  Henry Hallett Dale meninggal dunia di Grantabridge.
Who did Diefenbaker marry?
  Diefenbaker is married to John Diefenbaker.
Siapa pasangan Diefenbaker?
  Diefenbaker menikahi John Diefenbaker.
How tall is Calpine Center?
  Calpine Center's height 138.074 m.
Berapa tinggi Calpine Center?
  Tinggi Calpine Center adalah 138,074 m.
What does Pearlette lead?
  Pearlette is a leader of St.lucia.
Pearlette memimpin apa?
  Pearlette adalah pemimpin St.lucia.
Where does Ty Tryon live?
  Ty Tryon lives in Orlando, Fla..
Ty Tryon tinggal di mana?
  Ty Tryon tinggal di Orlando, Fla..
What movies did Markowitz direct?
  Markowitz directed Murder in the Heartland.
Markowitz menyutradarai film apa?
  Markowitz menyutradarai Murder in the Heartland.
What did Thalía create?
  Thalía created I Want You/Me Pones Sexy.
Apa yang dibuat Thalía?
  Thalía membuat I Want You/Me Pones Sexy.
How much is Pōtītī's inflation?
  Pōtītī's inflation is 1.1 %.
Berapa inflasi Pōtītī?
  Inflasi Pōtītī adalah 1,1 %.
What is the capital city of Kingdom of Bavaria?
  Capital city of Kingdom of Bavaria is Minga.
Apa ibu kota Kingdom of Bavaria?
  Ibu kota Kingdom of Bavaria adalah Minga.
How much does Mária Mohácsik weight?
  Mária Mohácsik weights 70,000 g.
Berapa berat Mária Mohácsik?
  Berat Mária Mohácsik adalah 70.000 g.
What is the language code of Gujarati (India)?
  The language code of Gujarati (India) is 'gu'.
Apa kode bahasa dari Gujarati (India)?
  Kode bahasa dari Gujarati (India) adalah 'gu'.
What movies star Raaj Kumar?
  Raaj Kumar acted in Pakeezah.
Film apa saja yang dibintangi Raaj Kumar?
  Raaj Kumar membintangi Pakeezah.

Setelah data uji siap, langkah selanjutnya tentunya mengusahakan agar aplikasi yang dijalankan dapat lulus/pass semua tes-tes di atas. :-) Amiiin.

Integration with YAGO2s as Semantic Knowledge Base

I recently had conversation with Professor Fabian M. Suchanek, Associate Professor of Télécom ParisTech University in Paris, creator of YAGO2s semantic knowledge base. I'm truly thankful that Professor Suchanek made his hard and persistent work available to public as open structured data.


I'm grateful he helped me to explore concerns about my thesis I need to put it down here for my own reference as well, and hopefully useful to you. :-)

I'm studying masters in Bandung Institute of Technology, Indonesia, and my thesis is knowledge base for Lumen Robot Friend. Lumen is originally based on NAO but for the purposes of my thesis, I expect it to be more server-side with clients for NAO, Android, and Windows. The server is also connected to Facebook, Twitter, and Google Hangouts for text/chat interaction.

Primary means of interaction would be through natural language interaction (text and speech for English, and text only for Indonesian) via device (such as NAO) and social media integrations.

Inputs will be parsed into semantic atoms which will then be asserted into memory (or persistence database) for reasoning which in turn be used as parameters for fitness scoring algorithm in order to select appropriate responses/actions (if so desired).

YAGO2s will be used as the knowledge base so Lumen will know the answer to straightforward questions like "where is Paris?" and hopefully a bit more complex (not in terms of database query, but in terms of mapping from natural language to semantic query and back to natural language generation) questions like "which cheese are made in France?" (BTW, I *love* cheese! Thank you cheese inventors :) )

Other than hard facts, Lumen can learn simple intelligence for socialization (I'm targeting a prekindergaten-like semantic and reasoning ability) via daily interaction and specialized training UI. Things like who its friends are, what food its friends likes etc.

I hope to not be too ambitious with my thesis, as I'm exploring boundaries to assess which objectives would be attainable (and which to leave out of scope) during my thesis (which would be the next 18 months). :) Although it sounds it's gonna be dealing with tens of GBs of data (YAGO2s is already 22 GB, before indexing.. and I still have WordNet, etc. copies lying all around my hard drive, hehe..), the scope of my thesis will be much more limited than that.

My plan would be to devise a set of rule languages (DSLs) which for one, would allow pattern matching of subhypergraphs (as OpenCog's AtomSpace put it) to a YAGO property. Just like a switch statement but for graphs instead of literals, in practice the current prototype works "alright" with just subtrees so I hope won't actually to match subgraphs :)

So "where is Paris?" / "Paris ada di mana?" would be turned by NLP parser into semantic representation like:

(QuestionLink where Paris)

which we can write a rule e.g.

when
  (QuestionLink where $place)
then
  select o as $loc
    where s = $place and p = 'isLocatedIn'

  assert (EvaluationLink (isLocatedIn $place $loc))

which will assert statement:

(EvaluationLink
  (isLocatedIn Paris
    [ "Europe", "Île-de-France_(region)" ]))


which the NLP generation will produce:

Paris is located in Europe and Île de France (region).

...which is still correct although a human probably will answer it in a different way. :)

The goal would be to develop this DSL such that it should be possible to express mappings from semantic queries into YAGO KB queries into natural language answer. I'll write only a few rules, as proof and example, not to be expansive. More rules can then be written to increase coverage which is outside the scope of thesis.

I also plan to limit the scope of NLP processing and word sense disambiguation (WSD), so I'll just use link-grammar and OpenCog's RelEx therefore parseable sentences are also scoped by these libraries' capability.

I also need to state my scope better as to not inflate the expectations of my thesis advisor, Dr.techn. Ary Setijadi Prihatmanto. :)

Rabu, 02 Juli 2014

Mensimulasikan Rasa Simpati pada Robot dengan MOSES Machine Learning

Kalimat dengan format pertanyaan-jawaban mempunyai ekspektasi yang jelas tentang hasil interaksi dengan robot. Tapi bagaimana dengan obrolan cerita?

Misalkan bila mendapati temannya posting status:
"Aku baru putus..."
Maka robot dapat menanggapi "aww... aku ikut sedih" atau semacamnya.


Situasinya tidak hanya pembicaraan pribadi (chatbot), namun juga dari komunikasi di group maupun wall social media, yang bisa mentrigger respons maupun tidak.

Tahapan yang dibutuhkan:
  1. Melakukan parsing input/berita yang diterima (baik channel pribadi maupun umum) ke struktur semantic
  2. Memasukkan (assert) struktur semantic tersebut ke dalam long-term/short-term memory beserta annotation metadata (extractedFrom, timestamp)
  3. Melakukan inference yang bersifat prediksi (prasangka?) dari semantic tersebut (bila Titi baru putus, maka Titi sedang sedih)
  4. Mencari action plan yang paling masuk akal berdasarkan informasi yang diketahui saat ini (short-term dan long-term memory)
  5. Melakukan action plan (bisa jadi beberapa/sekuensial)
  6. Untuk action yang membutuhkan natural language generation, mentransformasi dari semantic action menjadi kalimat bahasa yang dituju.

Dengan asumsi bahwa modul parsing dari syntactic menjadi semantic dan sebaliknya sudah berfungsi baik (kenyataannya belum ya), maka tahap yang paling utama adalah tahap 4 yaitu mencari action plan.

MOSES sebagai procedural machine learning framework mungkin dapat digunakan. Dengan input pada short-term memory sebagai berikut:
  1. Titi baru putus
  2. Titi sedang sedih (inferred prediction)
  3. Fact #1 belum direspon

Maka kita dapat menjalankan MOSES untuk menggenerate "program" atau solusi dengan lingkup deme sesuai parameter di atas. "Program" yang digenerate di sini berupa semantic sentence. Scoring atau fitness sebuah program didasarkan pada kecocokan statement pada fact #1 dan #2.

Andai hasil generate beserta fitnessnya sbb:
  1. Aku ikut senang = -0.5
  2. Aku ikut sedih = +0.5

Maka robot akan memilih respon #2 untuk dilakukan.

PRnya adalah:
  1. Membuat generator/permutasi respon berdasarkan deme
  2. Menentukan algoritma skor fitness bila diberikan respon dan situasi memory

Algoritma scoring fitness itu sendiri dapat menggunakan machine learning sehingga robot belajar untuk mengenali sendiri, mana respon yang baik dan mana respon yang tidak tepat untuk situasi tertentu.

Apakah MOSES cocok untuk ini? Atau lebih baik menggunakan PLN (Probabilistic Logic Networks), pendekatan lain, atau bahkan hack sederhana? Kita lihat saja nanti. :-)

Selasa, 01 Juli 2014

Using BabelNet 1.1.1 Multilingual Dictionary & Word Sense Disambiguation (Tutorial)

BabelNet v2.5 as of July 1st 2014 has not provided downloadable path indexes so I'm using BabelNet v1.1.1 for this tutorial.

How to Install BabelNet version 1.1.1


  1. Extract BabelNet API 1.1.1 as ~/babelnet-api-1.1.1
  2. Extract babelnet core lucene 1.1.1 to ~/babelnet-1.1.1
  3. Edit ~/babelnet-api-1.1.1/config/babelnet.var.properties :
    babelnet.dir=/home/ceefour/babelnet-1.1.1
  4. BabelNet demo requires WordNet 3.0 in /usr/local/share/wordnet-3.0/dict (by default).
    Download WordNet-3.0.tar.bz2 (8.6 MB) from http://wordnet.princeton.edu/wordnet/download/current-version/. Extract it to your home directory so it will create ~/WordNet-3.0 directory.
  5. Edit ~/babelnet-api-1.1.1/config/jlt.var.properties:
    jlt.wordnetPrefix=/home/ceefour/WordNet
  6. Using shell, go to ~/babelnet-api-1.1.1 and run:
    ./run-babelnetdemo.sh

Example: (output is very long, so this is not complete output)

ceefour@amanah:~/babelnet-api-1.1.1 > ./run-babelnetdemo.sh 
[ INFO  ] BabelNetConfiguration - Loading babelnet.properties FROM /home/ceefour/babelnet-api-1.1.1/config/babelnet.properties
[ INFO  ] BabelNet - OPENING BABEL LEXICON FROM: /home/ceefour/babelnet-1.1.1/lexicon
[ INFO  ] BabelNet - OPENING BABEL DICTIONARY FROM: /home/ceefour/babelnet-1.1.1/dict
[ INFO  ] BabelNet - OPENING BABEL GLOSSES FROM: /home/ceefour/babelnet-1.1.1/gloss
[ INFO  ] BabelNet - OPENING BABEL GRAPH FROM: /home/ceefour/babelnet-1.1.1/graph
SYNSETS WITH English word: "bank"
[ INFO  ] Configuration - Loading jlt.properties FROM /home/ceefour/babelnet-api-1.1.1/config/jlt.properties
  =>(bn:00008363n) SOURCE: WIKIWN; TYPE: Concept; WN SYNSET: [09213565n];
  MAIN LEMMA: bank#n#1;
  IMAGES: [<a href="http://upload.wikimedia.org/wikipedia/commons/8/8c/Kuekenhoff_Canal_002.jpg">Kuekenhoff_Canal_002.jpg</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/9/93/Namoi-River-sand-bank.jpg">Namoi-River-sand-bank.jpg</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/1/10/Skawa_River,_Poland,_flood_2001.jpg">Skawa_River,_Poland,_flood_2001.jpg</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/c/c6/RanelvaSelfors08.JPG">RanelvaSelfors08.JPG</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/d/d6/Albertville_Voie_sur_berge.JPG">Albertville_Voie_sur_berge.JPG</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/6/6b/Wheeling_Creek_Ohio.jpg">Wheeling_Creek_Ohio.jpg</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/1/12/Regge_river_P3260276.JPG">Regge_river_P3260276.JPG</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/5/55/2Kanal_bei_Tritolwerk.jpg">2Kanal_bei_Tritolwerk.jpg</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/8/89/Shirakara_Canal,_Gion,_Kyoto.jpg">Shirakara_Canal,_Gion,_Kyoto.jpg</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/2/24/Shukugawa03s3200.jpg">Shukugawa03s3200.jpg</a>, <a href="http://upload.wikimedia.org/wikipedia/commons/a/af/Damaged_Park_Road_at_Carbon.jpg">Damaged_Park_Road_at_Carbon.jpg</a>];
  CATEGORIES: [BNCAT:EN:Hydrology, BNCAT:EN:Geomorphology, BNCAT:EN:Limnology, BNCAT:EN:Freshwater_ecology, BNCAT:EN:Fluvial_landforms, BNCAT:EN:Riparian, BNCAT:EN:Rivers, BNCAT:EN:Water_streams, BNCAT:EN:Water_and_the_environment, BNCAT:FR:Cours_d'eau];
  SENSES (German): { WIKITR:DE:bank_0.55000_11_20 WIKITR:DE:streamside_1.00000_1_1 WIKITR:DE:flussufern_0.40000_2_5 WIKITR:DE:ufer_0.40000_2_5 WIKITR:DE:strom_bank_0.42857_3_7 WIKITR:DE:streambanks_1.00000_1_1 WIKITR:DE:ufer_0.42857_9_21 WNTR:DE:bank_0.53846_7_13 }
  -----
    EDGE gdis bn:00110761a { WN:EN:sloping }
    EDGE gdis bn:00046303n { WN:EN:slope, WN:EN:incline, WN:EN:side }
    EDGE gdis bn:00011766n { WIKIRED:EN:Water(molecule), WIKIRED:EN:Hydrogen_oxide, WIKIRED:EN:Water_(liquid), WIKIRED:EN:H₂O, WIKIRED:EN:Hydroxilic_acid, WIKIRED:EN:Oxygen_dihydride, WIKIRED:EN:Water_body, WIKIRED:EN:Chemical_water, WIKIRED:EN:Μ-oxido_dihydrogen, WIKIRED:EN:Hydric_oxide, WIKIRED:EN:Water_(molecule), WIKIRED:EN:Dihydrogenoxide, WIKIRED:EN:Diprotium_oxide, WIKIRED:EN:Hydroxylic_acid, WIKIRED:EN:OH2, WIKIRED:EN:Bodies_of_water, WIKIRED:EN:Density_of_water, WIKI:EN:Properties_of_water, WIKIRED:EN:Hydrogen_Hydroxide, WIKIRED:EN:Hydroxic_acid, WIKIRED:EN:Dihydrogen_oxide, WIKI:EN:Body_of_water, WIKIRED:EN:Waterbodies, WIKIRED:EN:Hydrogen_hydroxide, WIKIRED:EN:Water_(properties), WIKIRED:EN:Water_(Molecule), WIKIRED:EN:Unique_properties_of_water, WIKIRED:EN:Μ-oxido_hydrogen, WIKIRED:EN:H1.5O, WIKIRED:EN:Hydroxyl_monohydride, WIKIRED:EN:Hydrohydroxic_acid, WIKIRED:EN:Hydrogen_monoxide, WIKIRED:EN:Waterbody, WIKIRED:EN:Water_molecule, WIKIRED:EN:Water_bodies, WN:EN:body_of_water, WN:EN:water }
...
    EDGE r bn:01001902n { WIKIRED:EN:Spa_(Belgium), WIKI:EN:Spa,_Belgium }
  -----
  =>(bn:03802146n) SOURCE: WIKI; TYPE: Concept; WN SYNSET: [];
  MAIN LEMMA: WIKI:EN:Ocean_bank_(topography);
  IMAGES: [];
  CATEGORIES: [BNCAT:EN:Physical_oceanography, BNCAT:EN:Fishing_banks, BNCAT:EN:Undersea_banks];
  SENSES (German): { WIKITR:DE:ozean_bank_1.00000_1_1 WIKITR:DE:bank_0.50000_1_2 WIKITR:DE:ufer_0.50000_1_2 WIKITR:DE:bank_0.94444_17_18 WIKITR:DE:fischerei_bank_0.25000_5_20 WIKITR:DE:fishing_bank_0.25000_5_20 }
  -----
    EDGE r bn:03180024n { WIKIRED:EN:Carbonate_mound, WIKIRED:EN:Platform_carbonate, WIKI:EN:Carbonate_platform, WIKIRED:EN:Carbonate_platforms }
    EDGE r bn:00175840n { WIKIRED:EN:Coastal_upwelling, WIKI:EN:Upwelling }
    EDGE r bn:00009026n { WIKI:EN:Continental_margin, WIKI:EN:Bathyal_zone, WIKIRED:EN:Continental-margin, WIKIRED:EN:Continental_slope, WIKIRED:EN:Bathypelagic, WIKIRED:EN:Midnight_Zone, WIKIRED:EN:Bathyal_Zone, WIKIRED:EN:Bathyal, WIKIRED:EN:Passive_continental_margin, WIKIRED:EN:Active_continental_margin, WN:EN:continental_slope, WN:EN:bathyal_zone, WN:EN:bathyal_district }
    EDGE r bn:00047612n { WIKIRED:EN:Volcanic_isles, WIKIRED:EN:Islands, WIKIRED:EN:Volcanic_islands, WIKIRED:EN:IslandS, WIKIRED:EN:Ocean_islands, WIKIRED:EN:Former_island, WIKI:EN:Island, WIKIRED:EN:Eilean, WIKIRED:EN:Pulau, WN:EN:island }
    EDGE r bn:02811607n { WIKIRED:EN:Grand_Banks, WIKI:EN:Grand_Banks_of_Newfoundland, WIKIRED:EN:Great_Banks }
    EDGE r bn:00025408n { WIKIRED:EN:Seafloor, WIKIRED:EN:Seafloor_exploration, WIKIRED:EN:Sea_floor, WIKIRED:EN:Marine_floor, WIKIRED:EN:Underwater_seafloor_exploration, WIKI:EN:Seabed, WIKI:EN:Davy_Jones_(racing_driver), WIKIRED:EN:Ocean_floor, WIKIRED:EN:Davy_Jones_(driver), WIKIRED:EN:Sea_bed, WN:EN:ocean_floor, WN:EN:sea_floor, WN:EN:ocean_bottom, WN:EN:seabed, WN:EN:sea_bottom, WN:EN:Davy_Jones's_locker, WN:EN:Davy_Jones }
    EDGE r bn:03225190n { WIKI:EN:Oceanic_plateau, WIKIRED:EN:Submarine_Plateau, WIKIRED:EN:Oceanic_Plateau, WIKIRED:EN:Submarine_plateau }
    EDGE r bn:00383615n { WIKIRED:EN:Peñasco_Quebrado, WIKIRED:EN:Middle_Farallon_Island, WIKIRED:EN:Maintop_Island, WIKIRED:EN:Farallon_Island_Nuclear_Waste_Dump, WIKIRED:EN:Drunk_Uncle_Islets, WIKIRED:EN:Sugarloaf_Island, WIKIRED:EN:Aulone_Island, WIKIRED:EN:Farallon_Island, WIKIRED:EN:Great_Arch_Rock, WIKIRED:EN:Farallones, WIKIRED:EN:Seal_Rock,_Farallon_Islands, WIKIRED:EN:Piedra_Guadalupe, WIKIRED:EN:Farallón_Islands, WIKIRED:EN:Farallon_Islands_National_Wildlife_Refuge, WIKIRED:EN:Farallon_National_Wildlife_Refuge, WIKIRED:EN:Farallon_Wilderness, WIKI:EN:Farallon_Islands, WIKIRED:EN:North_Farallon_Island, WIKIRED:EN:Seal_Rock_(Farallon_Islands), WIKIRED:EN:Island_of_St._James, WIKIRED:EN:Farallone_Islands, WIKIRED:EN:Farallón_Viscaíno, WIKIRED:EN:Southeast_Farallon_Island }
    EDGE r bn:00069946n { WIKI:EN:Sea, WIKIRED:EN:Worlds_seas, WIKIRED:EN:ทะเล, WN:EN:sea }
    EDGE r bn:00049842n { WIKI:EN:Landmass, WIKI:EN:Land_mass, WN:EN:landmass, WN:EN:land_mass }
    EDGE r bn:00070032n { WIKI:EN:Seamount, WIKIRED:EN:Sea_mount, WIKIRED:EN:Seamounts, WN:EN:seamount }
    EDGE r bn:03433800n { WIKIRED:EN:Sedimented, WIKIRED:EN:Sedimentary_soil, WIKIRED:EN:Sedements, WIKIRED:EN:Detrital_sediment, WIKIRED:EN:Sea_Sediment, WIKI:EN:Sediment, WIKIRED:EN:Bomb_sag, WIKIRED:EN:Sediments, WIKIRED:EN:Sedimentary_layer }
    EDGE r bn:01303244n { WIKI:EN:Wachusett_Reef, WIKIRED:EN:Wachusett_Bank }
    EDGE r bn:00077192n { WIKIRED:EN:Tidal_flow, WIKIRED:EN:Tidal_current, WN:EN:tidal_flow, WN:EN:tidal_current }
    EDGE r bn:00071161n { WIKIRED:EN:Sandbank, WIKIRED:EN:Sand_bank, WIKIRED:EN:Shoals, WIKIRED:EN:Longshore_bar, WIKIRED:EN:Offshore_bar, WIKIRED:EN:Barrier_beach, WIKIRED:EN:Barrier_bar, WIKIRED:EN:Sandbars, WIKI:EN:Shoal, WIKIRED:EN:Bar_(landform), WIKIRED:EN:Sand_banks, WN:EN:shoal }
    EDGE r bn:00735063n { WIKIRED:EN:Dogger_bank, WIKIRED:EN:Dogger_Hills, WIKI:EN:Dogger_Bank, WIKIRED:EN:Doggerbank, WIKIRED:EN:Doggersbank }
    EDGE r bn:00080211n { WIKIRED:EN:Dormant_volcanoes, WIKIRED:EN:Volcanos, WIKIRED:EN:Extinct_volcanoes, WIKIRED:EN:How_volcanoes_are_formed, WIKIRED:EN:Volcano_eruption, WIKIRED:EN:Valcano, WIKIRED:EN:Volcanoe_facts, WIKIRED:EN:Volcanicity, WIKIRED:EN:Active_Volcano, WIKIRED:EN:Volcanic_vent, WIKIRED:EN:Volcanic_activity, WIKIRED:EN:Volcano_(geological_landform), WIKIRED:EN:Erupt, WIKIRED:EN:🌋, WIKIRED:EN:Extinct_Volcano, WIKIRED:EN:Volcanic_mountains, WIKIRED:EN:Volcanoes, WIKIRED:EN:Volcanoe, WIKIRED:EN:Volcanic_mountain, WIKIRED:EN:All_about_Volcanos, WIKI:EN:Volcano, WIKIRED:EN:Volcanic_aerosols, WIKIRED:EN:Volcanic, WIKIRED:EN:Last_eruption, WIKIRED:EN:Crater_Row, WIKIRED:EN:Valcanos, WN:EN:volcano }
    EDGE r bn:03087586n { WIKIRED:EN:Deep-sea, WIKIRED:EN:Ocean_depths, WIKIRED:EN:Deep_ocean, WIKIRED:EN:Deep_layer, WIKI:EN:Deep_sea }
    EDGE r bn:00006813n { WIKIRED:EN:Faru, WIKIRED:EN:Darwin_point, WIKI:EN:Atoll, WIKIRED:EN:Coral_atoll, WIKIRED:EN:Atolls, WIKIRED:EN:Atoll_reef, WN:EN:atoll }
  -----

About WordNet 3.0 Ubuntu package

WordNet 3.0 Ubuntu package is not usable for BabelNet but is educational and informational.
To install WordNet 3.0 Ubuntu package: (6.5 MB)

sudo aptitude install wordnet

WordNet 3.0 will be installed at /usr/share/wordnet.
There's also wordnet executable that you can use, e.g. "wordnet bird -over"

ceefour@amanah:~ > wordnet bird -over

Overview of noun bird

The noun bird has 5 senses (first 2 from tagged texts)
                                           
1. (29) bird -- (warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings)
2. (1) bird, fowl -- (the flesh of a bird or fowl (wild or domestic) used as food)
3. dame, doll, wench, skirt, chick, bird -- (informal terms for a (young) woman)
4. boo, hoot, Bronx cheer, hiss, raspberry, razzing, razz, snort, bird -- (a cry or noise made to express displeasure or contempt)
5. shuttlecock, bird, birdie, shuttle -- (badminton equipment consisting of a ball of cork or rubber with a crown of feathers)

Overview of verb bird

The verb bird has 1 sense (no senses from tagged texts)
                                          
1. bird, birdwatch -- (watch and study birds in their natural habitat)