Senin, 07 Juli 2014

Adding Distributed Indexes to Hypergraph Database for Horizontal Scaling of Semantic Reasoning

While discussing distributed AtomSpace architecture in OpenCog group, Dr. Linas Vepstas noted:
Reference resolution, reasoning and induction might be fairly local as well:  when reading and trying to understand a wikipedia article, it seems as if its related to a million different things.   A single CPU with 16GB RAM can hold 100 million atoms in RAM, requiring no disk or network access.   

The only reason for a database then becomes as a place to store, over long periods of time, the results of the computation.  Its quite possible that fast performance of the database won't actually be important.  Which would mean that the actual database architecture might not be very important.  Maybe.
Based on the experiments, while processing (i.e. reasoning) 200,000 "atoms" in 3 seconds on a single host isn't too bad, searching for a few atoms out of 200,000 (or even 1 billion) on single host should take very fast (i.e. ~ 1 ms or less).

So I guess these are two distinct tasks. Searching would use (distributed) indexing, while processing/reasoning can be done by MindAgents combining data-to-compute and compute-to-data, with consideration to data affinity.

For processing which requires non-local data that Dr. Vepstas concerned, when using compute+data grid such as GridGain, a compute grid is automatically a cache, so all required non-local data are automatically cached. Which may or may not be sufficient, depending on the algorithm.

For searches, it seems we need to create separate indexes for each purpose, each index is sharded/partitioned appropriately to distribute compute load. Which means AtomSpace data grid is will have redundancy in many ways. The AtomSpace can probably be "split" into 3 parts:
  1. the hypergraph part (can be stored in HyperGraphDB or Neo4j)
  2. the eager index parts, always generated for the entire hypergraph, required for searches (can be stored in Cassandra or Solr or ElasticSearch)
  3. the lazy index parts, the entries are calculated on demand then stored for later usage (can be stored in Cassandra or Solr or ElasticSearch)
The hypergraph would be good when you already know the handles, and for traversing. But when the task is "which handles A are B of the handles C assuming D is E?" an index is needed to answer this (particular task) quickly. Hopefully ~1 ms for each grid node, so 100 nodes working in parallel, will generate 100 set of answers in the ~1 ms.

Today, a 16 GB RAM node with 2 TB SATA storage is probably typical config (SSD will also work, but just for the sake of thought experiment a spinning disk more performance concerns). The node holds a partition of the distributed AtomSpace, and is expected to answer any search (i.e. give me handles of atoms in your node where it matches criteria X, Y, Z) within 1ms, and can do processing over a select nodes (i.e. for handles [A, B, C, ... N] perform this closure) within 1 second.

To achieve these goals:
  1. For quick searches for that partition, all atom data needs to be indexed in multiple ways, an index for each purpose
  2. For quick updates to the index (triggered by updates to data), the index and data are colocated in the same host to avoid network IO, although can be in different stores (i.e. data in HyperGraphDB and index in Cassandra). The partitioning/sharding need to accomodate this. So for 2 TB storage, we can put perhaps 100 GB data and 1 TB of indexes.
  3. For quick lookup and updates of subset of data, the RAM is used as read-through & write-through cache by the data grid.
  4. For non-local search/update/lookup/processing, it uses the data grid to do so, and caches results locally in RAM, that can overflow to disk. We still have 900 GB of space left, so we can use it for this purpose.
  5. For quick processing of subset of data, local lookups are performed (which should take near-constant time, even with drives) and much faster if requested data is already in cache. Processing is then done using CPU or GPGPU (via OpenCL, e.g. Encog neural network library uses OpenCL to accelerate calculations). Results are then sent back via network.
For question-answering, given the label (e.g. Ibnu Sina) and possible concept types (Person), and optionally discussion contexts (Islam, religion, social, medicine), find the ConceptNode's which has that label, that type, and the confidence value for each contexts. And I want it done in 1 ms. :D

YAGO has 15,372,313 labels (1.1 GB dataset) for 10+ million entities. The entire YAGO is 22 GB. Assuming the entities with labels are stored in AtomSpace, selecting the matching labels without index would take ~150 seconds on a single host and ~50 seconds on 3 nodes (extrapolating my previous results). With indexes this should be 1ms.

First index would give the concepts given a label and types, with structure like :

label -> type -> [concept, concept, concept, ...]
            type -> [concept, concept, concept, ...]
            type -> [concept, concept, concept, ...]

Second index would give the confidence, given a concept and contexts, with sample data like :

Ibnu_Sina1 -> { Islam: 0.7, medicine: 0.9, social: 0.3, ... }
Ibnu_Sina2 -> { Islam: 0.1, medicine: 0.3, social: 0.9, ... }

Indexes change constantly, for each atom change multiple indexes must be updated, and index updates would take more resources than updating the atoms themselves, so index updates are asynchronous and eventually consistent. (I guess this also happens on humans, when humans learn new information, they don't immediately "understand" it. I mean, we now know a new fact, but it takes time [or even sleep] to make sense or implications/correlations of that new fact.)

We should agree on a set of a priori indexes. (As new concepts are learned and OpenCog gets queries that take a long time processing too many atoms, the AI may learn to make new indexes or tune existing ones... although this is probably too meta and distant future. :D )