Kamis, 03 Juli 2014

Integration with YAGO2s as Semantic Knowledge Base

I recently had conversation with Professor Fabian M. Suchanek, Associate Professor of Télécom ParisTech University in Paris, creator of YAGO2s semantic knowledge base. I'm truly thankful that Professor Suchanek made his hard and persistent work available to public as open structured data.

I'm grateful he helped me to explore concerns about my thesis I need to put it down here for my own reference as well, and hopefully useful to you. :-)

I'm studying masters in Bandung Institute of Technology, Indonesia, and my thesis is knowledge base for Lumen Robot Friend. Lumen is originally based on NAO but for the purposes of my thesis, I expect it to be more server-side with clients for NAO, Android, and Windows. The server is also connected to Facebook, Twitter, and Google Hangouts for text/chat interaction.

Primary means of interaction would be through natural language interaction (text and speech for English, and text only for Indonesian) via device (such as NAO) and social media integrations.

Inputs will be parsed into semantic atoms which will then be asserted into memory (or persistence database) for reasoning which in turn be used as parameters for fitness scoring algorithm in order to select appropriate responses/actions (if so desired).

YAGO2s will be used as the knowledge base so Lumen will know the answer to straightforward questions like "where is Paris?" and hopefully a bit more complex (not in terms of database query, but in terms of mapping from natural language to semantic query and back to natural language generation) questions like "which cheese are made in France?" (BTW, I *love* cheese! Thank you cheese inventors :) )

Other than hard facts, Lumen can learn simple intelligence for socialization (I'm targeting a prekindergaten-like semantic and reasoning ability) via daily interaction and specialized training UI. Things like who its friends are, what food its friends likes etc.

I hope to not be too ambitious with my thesis, as I'm exploring boundaries to assess which objectives would be attainable (and which to leave out of scope) during my thesis (which would be the next 18 months). :) Although it sounds it's gonna be dealing with tens of GBs of data (YAGO2s is already 22 GB, before indexing.. and I still have WordNet, etc. copies lying all around my hard drive, hehe..), the scope of my thesis will be much more limited than that.

My plan would be to devise a set of rule languages (DSLs) which for one, would allow pattern matching of subhypergraphs (as OpenCog's AtomSpace put it) to a YAGO property. Just like a switch statement but for graphs instead of literals, in practice the current prototype works "alright" with just subtrees so I hope won't actually to match subgraphs :)

So "where is Paris?" / "Paris ada di mana?" would be turned by NLP parser into semantic representation like:

(QuestionLink where Paris)

which we can write a rule e.g.

  (QuestionLink where $place)
  select o as $loc
    where s = $place and p = 'isLocatedIn'

  assert (EvaluationLink (isLocatedIn $place $loc))

which will assert statement:

  (isLocatedIn Paris
    [ "Europe", "Île-de-France_(region)" ]))

which the NLP generation will produce:

Paris is located in Europe and Île de France (region).

...which is still correct although a human probably will answer it in a different way. :)

The goal would be to develop this DSL such that it should be possible to express mappings from semantic queries into YAGO KB queries into natural language answer. I'll write only a few rules, as proof and example, not to be expansive. More rules can then be written to increase coverage which is outside the scope of thesis.

I also plan to limit the scope of NLP processing and word sense disambiguation (WSD), so I'll just use link-grammar and OpenCog's RelEx therefore parseable sentences are also scoped by these libraries' capability.

I also need to state my scope better as to not inflate the expectations of my thesis advisor, Dr.techn. Ary Setijadi Prihatmanto. :)