Senin, 07 Juli 2014

Neo4j as Graph Database for OpenCog AtomSpace architecture?

Why a triplestore? Why not just a plain-old ordinary no/coSQL DB?  (e.g. casandra, mongodb couchdb, redis, whatever) Is there something a triplestore can do better?
Also: I will say this again an again: the current opencog system scales quite well to at least 5 nodes.  I suspect that, without much work, it could scale to orders of magnitude more.   I think we should learn how to walk before we try to run.
As far a I know, I am the only person who has ever even attempted to run opencog on more than one machine.  Its not hard ...
Anyway, I think there's another important task to consider, before scalability:  the ability to remember things, and to manage and control what is remembered.  Yes, LTI and VLTI is supposed to do this automatically, but, again, as far as I know, no one besides myself has ever taught something to an opencog system, then checkpointed or hibernated or 'cryogenicaly froze' the system, and then revived it to do more work.   This is not hard in principle, but tricky in practice because its easy to save too much or too little.
Yes, I agree, there's a certain set of issues that could potentially dog the current atomspace design. However, I don't think that any of us has sufficient experience with the system to say what they are, and so trying to design a brand new system that slays some imagined dragons is a distraction: the real dragons are almost always somewhere else.
I agree with Dr. Vepstas here. I also agree with Dr. Vepstas' other point with regard to PostgreSQL.

I have experience with MongoDB, PostgreSQL, CouchDB, Jena TDB, Neo4j, and Cassandra so I'd like to share.

At one end of this spectrum (w.r.t query mechanism) are PostgreSQL, then MongoDB in the middle, and CouchDB then Cassandra at the other end.

Jena TDB and Neo4j is different as they're graph databases. Jena TDB is an RDF quadstore (with named graph support) so I believe Systap's is similar here.

My vote goes to Neo4j, but PostgreSQL is a good contender too, as I'll elaborate below.

I'd eliminate Cassandra and CouchDB. Cassandra's architecture requires schema-first approach to querying, which means in practice you need to write the same data into multiple "tables" or column families at once, and deal with updates in complex ways.

CouchDB is somewhat similar with addition that it has views as semi-convenient mechanism to structure queries, but querying like what Dr. Goertzel described is still going to be painful.

Cassandra and CouchDB's strengths are in horizontal scalability and multi-master replication, which could be useful in distributed Mind-Agents architecture. (but I think this would better be a separate concern)

The second I'd like to eliminate *for this phase* are RDF triple/quadstores like Jena TDB and Systap. BTW I'm a proponent of RDF and I'm using Linked Data technologies in my masters thesis Lumen Robot Friend Knowledge Base.
The reason is to use a triplestore you need to structure your data like RDF and then you need to think SPARQL. Oh and indexes.

One main issue is reification because RDF isn't n-ary. So reification isn't recommended anyway and nobody uses it so the best approach is singleton properties, like what YAGO2s does by Prof. Fabian Suchanek.

I think later on usage of RDF store can be revisited given its benefits (ecosystem, Linked Data Platform, standard everything like Turtle, SPARQL, tools, etc.) but at this point you'll spend too much effort trying to (retro)fit your data model into RDF world.

MongoDB is general-purpose enough and has good scalability, but I think storing OpenCog's AtomSpace structure won't use the main benefit of MongoDB which is flexible document structure and nested subdocuments. The cons to MongoDB are aggregation is not so intuitive and no support for joins.

I assume PostgreSQL needs no introduction, but would like to remind that it supports key-value columnJSONwindow functionsmaterialized views, and function based indexes. I think Instagram has proven that (with proper data modeling) PostgreSQL can scale. And if that isn't enough for scaling, there's the horizontally scalable Postgres-XL.

Neo4j in many ways similar to RDF store, but instead of statements (triples), Neo4j uses nodes, relationships/links, and properties inside both nodes and relationships. Which makes RDF reification unnecessary in most situations. Neo4j also has much more convenient query language called Cypher (compared to SPARQL) :

# which movies do I rate, how many stars, what comment?
MATCH (me:User {name: "Me"}) -[r:RATED]-> (movie)
RETURN r.stars, r.comment, movie.title;

For Dr. Goertzel's example:


EvaluationLink <.9,.2>
   (cat, mouse)

I'd like to suggest an alternative triplestore version as follows: (pseudo-Turtle format)


eat a PredicateNode,
cat a ConceptNode
mouse a ConceptNode

cat eat#1 mouse

eat#1 a EvaluationLink
eat#1 singletonPropertyOf eat
eat#1 hasStrength .9
eat#1 hasConfidence .2

I put "evaluationLink" as lowercase since I assumed it's treated as an RDF predicate instead of an RDF Class.
I use singleton property method for reification of evaluationLink relationship.
Note that I "cheated" here by mapping ListLink directly into subject-predicate-object triple.
Ordered lists are difficult to express conveniently in RDF triples, and this is generally the case in graph databases. Sets are much easier.
I'd recommend changing EvaluationLink structure to "(EvaluationLink truthvalue predicate subject object)" which is easier to map to databases (and to OO classes), and removing ListLink's whenever possible.

In Neo4j it'd be structured as: (valid Cypher query)

(eat:PredicateNode {name: "eat"}),
(cat:ConceptNode {name: "cat"}),
(mouse:PredicateNode {name: "mouse"}),
(eat1:EvaluationLink {name: "eat1", strength: 0.9, confidence: 0.2}),

The really nice thing (at least for starters) about Neo4j is instant gratification. You can get the following visualization simply by doing the equivalent of "SELECT * FROM table". I believe even someone unfamiliar with OpenCog can grasp the essence of this graph:

You can actually play with our sample graph above online here :

Again for this sample the ListLink is collapsed into distinct SUBJECT and OBJECT relationships, which is easier to model and more intuitive. It's possible to model ListLink in Neo4j using relationship properties but it doesn't look as pretty for this example. :) But it is usable for other cases where an actual ordered list is required.

For above example, with PostgreSQL I'd create predicatenode and conceptnode tables, and find a way to structure evaluationlink taking consideration the polymorphism and dynamic nature of AtomSpace (assuming we want to use foreign keys & joins). It's not as straightforward as Neo4j, but PostgreSQL has features like hstore that make it flexible. Also with full control of indexes, views, etc. should be possible to tweak its performance for large datasets.