Knowledge Base Lumen Robot Friend

Tampilkan postingan dengan label Semantic Reasoning. Tampilkan semua postingan

Rabu, 14 Januari 2015

Paper Penelitian Basis Data Semantic YAGO

Berikut link ke paper tentang YAGO semantic knowledge base: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/publications/

Menurut saya kelebihan ontology YAGO sbb:
1. Multilingual
2. Support spatial dan geolocation
3. Support temporal information
4. Support consider relasi multiple maupun date range based, misalnya si A ceraiin B trus menikah lagi dengan C. Atau D poligami dengan E dan F berikut time rangenya, ini bs dimodelkan. Belum tau dech kalo perselingkuhan ada ontologynya ga wkwkwk :p
5. Open source (seluruh dataset ~22 GB bs didownload)
6. Teruji dengan dataset sangat besar (entire Wikipedia)
7. Main researchernya, Dr. Fabian Suchanek, Associate Professor di Telecom ParisTech dan Max Planck Institute, sangat supportif lho. Kalo ada yang riset sekitar YAGO dia seneng banget, dan potensi kolaborasi juga.
8. Data modelnya (RDF/TTL) memiliki standard representation JSON-LD yg mudah ditransfer via network/messaging, diproses browser/JavaScript, jg human readable.

Di newsnya katanya MPI coming soon YAGO3 yg harusnya penyempurnaan dari YAGO2s yang sempet diulik kemarin. :)

Senin, 14 Juli 2014

Pemrograman Probabilistik dengan Church

Church merupakan bahasa pemrograman probabilistik dengan sintaks menyerupai Scheme/Common Lisp, dan mempunyai implementasi JavaScript yang dapat berjalan di web browser yaitu Webchurch.

Dengan Church maka melakukan sampling data probabilistik jadi lebih mudah, begitu pula untuk merepresentasikan karakter distribusi fungsi probabilistik sebagai prosedur.

Contoh program Webchurch yang saya buat untuk melakukan sampling 2 variabel berbobot sebagai berikut:

(define a (lambda () (flip 0.8)))
(define b (lambda (a) (if a (flip) (flip 0.3))))
(define ab (lambda ()
(define A (a))
(list A (b A))))
(hist (repeat 1000 ab) "P(A, B)")

Hasilnya:

Yee :)

Menarik sih... apakah bahasa pemrograman probabilistik ini akan bermanfaat untuk mengimplementasikan semantic reasoning maupun PLN? Belum tahu juga, sih...

Senin, 07 Juli 2014

Distributing and Parallelizing Probabilistic Logic Networks Reasoning

During discussion about making Probabilistic Logic Networks (PLN) allow parallel reasoning using distributed AtomSpace, Dr. Ben Goertzel noted:

As for parallelism, I believe that logic chaining can straightforwardly be made parallel, though this involves changes to the algorithms and their behavior as heuristics. For example, suppose one is backward chaining and wishes to apply deduction to obtain A --> C. One can then evaluate multiple B potentially serving the role here, e.g.

A --> B1, B1 --> C |- A -->C

A --> B2, B2 --> C |- A -->C

...

Potentially, each Bi could be explored in parallel, right? Also, in exploring each of these, the two terms could be backward chained on in paralle, so that e.g.

A --> B1

and

B1 --> C

could be explored in parallel...

In this way the degree of parallelism exploited by the backward chainer would expand exponentially during the course of a single chaining exploration, until reaching the natural limit imposed by the infrastructure.

This will yield behavior that is conceptually similar, though not identical, to serial backward chaining.

I haven't learned much about PLN yet, but I hope it can be made to work by distributing computation across AtomSpace nodes.

Other than strictly parallel, each path can be assigned a priority or heuristic, the tasks become a distributed priority queue. So a task finishing earlier can insert more tasks into the prority queue, and these new tasks don't have to be at the very end, but can be in the middle etc.

If we'd like to explore 1000 paths, and each of those generates another 1000 paths, ideally we'd have 1 million nodes. The first phase will be executed by 1000 nodes in parallel, the next 1 million paths will be executed by 1 million nodes in parallel, and we have a result in 2 ms. :) In practice we probably only can afford a few nodes, and limited time, can PLN use heuristic discovery?

For example, if the AI participates in "Are you smarter than a 5th grader", the discovery paths would be different than "calculate the best company strategy, I'll give you two weeks and detailed report". In a quiz, the AI would need to come up with a vague answer quickly, then refine the answer progressively until time runs out. i.e. when requested 10 outputs, the quiz one will try to get 10 answers as soon as possible even if many of them are incorrect; and the business one will try to get 1 answer correct, even if it means the other 9 is left unanswered.

Does PLN do this? If so, the distributed AtomSpace architecture would evolve hand-in-hand with (distributed) PLN. An app or modules shouldn't be required to be distributed to use AtomSpace, however a module (like PLN) that's aware that AtomSpace is both a distributed data grid and a distributed compute grid, can take advantage of this architecture and make its operations much faster/scalable. It's akin to difference between rendering 3D scenes by CPU vs. using OpenGL-accelerated graphics. However, a computer usually have only 1 or 2 graphics card and fixed, where an AtomSpace cluster can have dynamic number of nodes and you can throw more at it at any time. i.e. for expensive computation you can launch 100 EC2 instances for several hours then turn it off when done.

Adding Distributed Indexes to Hypergraph Database for Horizontal Scaling of Semantic Reasoning

While discussing distributed AtomSpace architecture in OpenCog group, Dr. Linas Vepstas noted:

Reference resolution, reasoning and induction might be fairly local as well: when reading and trying to understand a wikipedia article, it seems as if its related to a million different things. A single CPU with 16GB RAM can hold 100 million atoms in RAM, requiring no disk or network access.

The only reason for a database then becomes as a place to store, over long periods of time, the results of the computation. Its quite possible that fast performance of the database won't actually be important. Which would mean that the actual database architecture might not be very important. Maybe.

Based on the experiments, while processing (i.e. reasoning) 200,000 "atoms" in 3 seconds on a single host isn't too bad, searching for a few atoms out of 200,000 (or even 1 billion) on single host should take very fast (i.e. ~ 1 ms or less).

So I guess these are two distinct tasks. Searching would use (distributed) indexing, while processing/reasoning can be done by MindAgents combining data-to-compute and compute-to-data, with consideration to data affinity.

For processing which requires non-local data that Dr. Vepstas concerned, when using compute+data grid such as GridGain, a compute grid is automatically a cache, so all required non-local data are automatically cached. Which may or may not be sufficient, depending on the algorithm.

For searches, it seems we need to create separate indexes for each purpose, each index is sharded/partitioned appropriately to distribute compute load. Which means AtomSpace data grid is will have redundancy in many ways. The AtomSpace can probably be "split" into 3 parts:

the hypergraph part (can be stored in HyperGraphDB or Neo4j)
the eager index parts, always generated for the entire hypergraph, required for searches (can be stored in Cassandra or Solr or ElasticSearch)
the lazy index parts, the entries are calculated on demand then stored for later usage (can be stored in Cassandra or Solr or ElasticSearch)

The hypergraph would be good when you already know the handles, and for traversing. But when the task is "which handles A are B of the handles C assuming D is E?" an index is needed to answer this (particular task) quickly. Hopefully ~1 ms for each grid node, so 100 nodes working in parallel, will generate 100 set of answers in the ~1 ms.

Today, a 16 GB RAM node with 2 TB SATA storage is probably typical config (SSD will also work, but just for the sake of thought experiment a spinning disk more performance concerns). The node holds a partition of the distributed AtomSpace, and is expected to answer any search (i.e. give me handles of atoms in your node where it matches criteria X, Y, Z) within 1ms, and can do processing over a select nodes (i.e. for handles [A, B, C, ... N] perform this closure) within 1 second.

To achieve these goals:

For quick searches for that partition, all atom data needs to be indexed in multiple ways, an index for each purpose
For quick updates to the index (triggered by updates to data), the index and data are colocated in the same host to avoid network IO, although can be in different stores (i.e. data in HyperGraphDB and index in Cassandra). The partitioning/sharding need to accomodate this. So for 2 TB storage, we can put perhaps 100 GB data and 1 TB of indexes.
For quick lookup and updates of subset of data, the RAM is used as read-through & write-through cache by the data grid.
For non-local search/update/lookup/processing, it uses the data grid to do so, and caches results locally in RAM, that can overflow to disk. We still have 900 GB of space left, so we can use it for this purpose.
For quick processing of subset of data, local lookups are performed (which should take near-constant time, even with drives) and much faster if requested data is already in cache. Processing is then done using CPU or GPGPU (via OpenCL, e.g. Encog neural network library uses OpenCL to accelerate calculations). Results are then sent back via network.

For question-answering, given the label (e.g. Ibnu Sina) and possible concept types (Person), and optionally discussion contexts (Islam, religion, social, medicine), find the ConceptNode's which has that label, that type, and the confidence value for each contexts. And I want it done in 1 ms. :D

YAGO has 15,372,313 labels (1.1 GB dataset) for 10+ million entities. The entire YAGO is 22 GB. Assuming the entities with labels are stored in AtomSpace, selecting the matching labels without index would take ~150 seconds on a single host and ~50 seconds on 3 nodes (extrapolating my previous results). With indexes this should be 1ms.

First index would give the concepts given a label and types, with structure like :

label -> type -> [concept, concept, concept, ...]

type -> [concept, concept, concept, ...]

Second index would give the confidence, given a concept and contexts, with sample data like :

Ibnu_Sina1 -> { Islam: 0.7, medicine: 0.9, social: 0.3, ... }

Ibnu_Sina2 -> { Islam: 0.1, medicine: 0.3, social: 0.9, ... }

Indexes change constantly, for each atom change multiple indexes must be updated, and index updates would take more resources than updating the atoms themselves, so index updates are asynchronous and eventually consistent. (I guess this also happens on humans, when humans learn new information, they don't immediately "understand" it. I mean, we now know a new fact, but it takes time [or even sleep] to make sense or implications/correlations of that new fact.)

We should agree on a set of a priori indexes. (As new concepts are learned and OpenCog gets queries that take a long time processing too many atoms, the AI may learn to make new indexes or tune existing ones... although this is probably too meta and distant future. :D )

Kamis, 03 Juli 2014

Knowledge Base YAGO2s untuk Uji Pengetahuan Robot

Semantic knowledge base YAGO2s akan digunakan sebagai data fakta untuk Lumen Knowledge Base.

Agar pengembangan aplikasi terarah dan evaluasinya terukur, maka perlu membuat data uji.

Untuk FitNesse acceptance testing nantinya, beberapa contoh data uji yang dihasilkan sebagai berikut, berupa pasangan pertanyaan dan jawaban dalam dua bahasa, Inggris dan Indonesia. Ini akan menguji dari kapabilitas baik dari segi language detection, natural language parsing, natural language generation, localization, dan semantic query untuk fakta langsung (bukan inference, tanpa reasoning).

English	Bahasa Indonesia
What is the airport code of Freeman Municipal Airport? Airport code of Freeman Municipal Airport is 'SER'.	Apa kode bandara Freeman Municipal Airport? Kode bandara Freeman Municipal Airport adalah 'SER'.
What is Huayna Picchu's latitude? Huayna Picchu's latitude is -13.158°.	Berapa lintang Huayna Picchu? Lintang Huayna Picchu adalah -13,158°.
When was Vampire Lovers destroyed? Vampire Lovers was destroyed on year 1990.	Kapan Vampire Lovers dihancurkan? Vampire Lovers dihancurkan pada tahun 1990.
What is the gini index of Republica De Nicaragua? Gini index of Republica De Nicaragua is 52.3%.	Berapa indeks gini Republica De Nicaragua? Indeks gini Republica De Nicaragua adalah 52,3%.
What movies did Anand Milind write the music for? Anand Milind wrote music for Jeevan Ki Shatranj.	Anand Milind menciptakan lagu untuk film apa? Anand Milind menciptakan lagu untuk Jeevan Ki Shatranj.
How many people live in Denton, Montana? Denton, Montana's population is 301 people.	Berapa populasi Denton, Montana? Populasi Denton, Montana adalah 301 orang.
How much is the GDP of Беларусь? GDP of Беларусь is $55,483,000,000.00.	Berapa PDB Беларусь? PDB Беларусь adalah USD55.483.000.000,00.
Where is House of Flora's website? House of Flora's website is at http://houseofflora.bigcartel.com/products.	Di mana alamat website House of Flora? Alamat website House of Flora ada di http://houseofflora.bigcartel.com/products.
How much is the revenue of Scientific-Atlanta? Revenue of Scientific-Atlanta is $1,900,000,000.00.	Berapa pendapatan Scientific-Atlanta? Pendapatan Scientific-Atlanta adalah USD1.900.000.000,00.
Who are the children of Rodney S. Webb? Children of Rodney S. Webb are Todd Webb.	Siapa saja anak Rodney S. Webb? Anak Rodney S. Webb adalah Todd Webb.
What is the currency of Kyrgzstan? Currency of Kyrgzstan is Kyrgyzstani som.	Apa mata uang Kyrgzstan? Mata uang Kyrgzstan adalah Kyrgyzstani som.
Where did Siege of Candia happen? Siege of Candia happened in Ηράκλειο.	Di mana Siege of Candia terjadi? Siege of Candia terjadi di Ηράκλειο.
What is the citizenship of Amanda Mynhardt? Amanda Mynhardt is a citizen of Republic of South Africa.	Amanda Mynhardt warganegara mana? Amanda Mynhardt adalah warganegara Republic of South Africa.
When was Stelios born? Stelios was born on Tuesday, November 15, 1977.	Kapan Stelios dilahirkan? Stelios lahir pada Selasa 15 November 1977.
Where did Henry Hallett Dale die? Henry Hallett Dale died in Grantabridge.	Di mana Henry Hallett Dale meninggal dunia? Henry Hallett Dale meninggal dunia di Grantabridge.
Who did Diefenbaker marry? Diefenbaker is married to John Diefenbaker.	Siapa pasangan Diefenbaker? Diefenbaker menikahi John Diefenbaker.
How tall is Calpine Center? Calpine Center's height 138.074 m.	Berapa tinggi Calpine Center? Tinggi Calpine Center adalah 138,074 m.
What does Pearlette lead? Pearlette is a leader of St.lucia.	Pearlette memimpin apa? Pearlette adalah pemimpin St.lucia.
Where does Ty Tryon live? Ty Tryon lives in Orlando, Fla..	Ty Tryon tinggal di mana? Ty Tryon tinggal di Orlando, Fla..
What movies did Markowitz direct? Markowitz directed Murder in the Heartland.	Markowitz menyutradarai film apa? Markowitz menyutradarai Murder in the Heartland.
What did Thalía create? Thalía created I Want You/Me Pones Sexy.	Apa yang dibuat Thalía? Thalía membuat I Want You/Me Pones Sexy.
How much is Pōtītī's inflation? Pōtītī's inflation is 1.1 %.	Berapa inflasi Pōtītī? Inflasi Pōtītī adalah 1,1 %.
What is the capital city of Kingdom of Bavaria? Capital city of Kingdom of Bavaria is Minga.	Apa ibu kota Kingdom of Bavaria? Ibu kota Kingdom of Bavaria adalah Minga.
How much does Mária Mohácsik weight? Mária Mohácsik weights 70,000 g.	Berapa berat Mária Mohácsik? Berat Mária Mohácsik adalah 70.000 g.
What is the language code of Gujarati (India)? The language code of Gujarati (India) is 'gu'.	Apa kode bahasa dari Gujarati (India)? Kode bahasa dari Gujarati (India) adalah 'gu'.
What movies star Raaj Kumar? Raaj Kumar acted in Pakeezah.	Film apa saja yang dibintangi Raaj Kumar? Raaj Kumar membintangi Pakeezah.

Setelah data uji siap, langkah selanjutnya tentunya mengusahakan agar aplikasi yang dijalankan dapat lulus/pass semua tes-tes di atas. :-) Amiiin.

Integration with YAGO2s as Semantic Knowledge Base

I recently had conversation with Professor Fabian M. Suchanek, Associate Professor of Télécom ParisTech University in Paris, creator of YAGO2s semantic knowledge base. I'm truly thankful that Professor Suchanek made his hard and persistent work available to public as open structured data.

I'm grateful he helped me to explore concerns about my thesis I need to put it down here for my own reference as well, and hopefully useful to you. :-)

I'm studying masters in Bandung Institute of Technology, Indonesia, and my thesis is knowledge base for Lumen Robot Friend. Lumen is originally based on NAO but for the purposes of my thesis, I expect it to be more server-side with clients for NAO, Android, and Windows. The server is also connected to Facebook, Twitter, and Google Hangouts for text/chat interaction.

Primary means of interaction would be through natural language interaction (text and speech for English, and text only for Indonesian) via device (such as NAO) and social media integrations.

Inputs will be parsed into semantic atoms which will then be asserted into memory (or persistence database) for reasoning which in turn be used as parameters for fitness scoring algorithm in order to select appropriate responses/actions (if so desired).

YAGO2s will be used as the knowledge base so Lumen will know the answer to straightforward questions like "where is Paris?" and hopefully a bit more complex (not in terms of database query, but in terms of mapping from natural language to semantic query and back to natural language generation) questions like "which cheese are made in France?" (BTW, I *love* cheese! Thank you cheese inventors :) )

Other than hard facts, Lumen can learn simple intelligence for socialization (I'm targeting a prekindergaten-like semantic and reasoning ability) via daily interaction and specialized training UI. Things like who its friends are, what food its friends likes etc.

I hope to not be too ambitious with my thesis, as I'm exploring boundaries to assess which objectives would be attainable (and which to leave out of scope) during my thesis (which would be the next 18 months). :) Although it sounds it's gonna be dealing with tens of GBs of data (YAGO2s is already 22 GB, before indexing.. and I still have WordNet, etc. copies lying all around my hard drive, hehe..), the scope of my thesis will be much more limited than that.

My plan would be to devise a set of rule languages (DSLs) which for one, would allow pattern matching of subhypergraphs (as OpenCog's AtomSpace put it) to a YAGO property. Just like a switch statement but for graphs instead of literals, in practice the current prototype works "alright" with just subtrees so I hope won't actually to match subgraphs :)

So "where is Paris?" / "Paris ada di mana?" would be turned by NLP parser into semantic representation like:

(QuestionLink where Paris)

which we can write a rule e.g.

when
(QuestionLink where $place)
then
select o as $loc
where s = $place and p = 'isLocatedIn'
assert (EvaluationLink (isLocatedIn $place $loc))

which will assert statement:

(EvaluationLink
(isLocatedIn Paris
[ "Europe", "Île-de-France_(region)" ]))

which the NLP generation will produce:

Paris is located in Europe and Île de France (region).

...which is still correct although a human probably will answer it in a different way. :)

The goal would be to develop this DSL such that it should be possible to express mappings from semantic queries into YAGO KB queries into natural language answer. I'll write only a few rules, as proof and example, not to be expansive. More rules can then be written to increase coverage which is outside the scope of thesis.

I also plan to limit the scope of NLP processing and word sense disambiguation (WSD), so I'll just use link-grammar and OpenCog's RelEx therefore parseable sentences are also scoped by these libraries' capability.

I also need to state my scope better as to not inflate the expectations of my thesis advisor, Dr.techn. Ary Setijadi Prihatmanto. :)

Rabu, 02 Juli 2014

Mensimulasikan Rasa Simpati pada Robot dengan MOSES Machine Learning

Kalimat dengan format pertanyaan-jawaban mempunyai ekspektasi yang jelas tentang hasil interaksi dengan robot. Tapi bagaimana dengan obrolan cerita?

Misalkan bila mendapati temannya posting status:

"Aku baru putus..."

Maka robot dapat menanggapi "aww... aku ikut sedih" atau semacamnya.

Situasinya tidak hanya pembicaraan pribadi (chatbot), namun juga dari komunikasi di group maupun wall social media, yang bisa mentrigger respons maupun tidak.

Tahapan yang dibutuhkan:

Melakukan parsing input/berita yang diterima (baik channel pribadi maupun umum) ke struktur semantic
Memasukkan (assert) struktur semantic tersebut ke dalam long-term/short-term memory beserta annotation metadata (extractedFrom, timestamp)
Melakukan inference yang bersifat prediksi (prasangka?) dari semantic tersebut (bila Titi baru putus, maka Titi sedang sedih)
Mencari action plan yang paling masuk akal berdasarkan informasi yang diketahui saat ini (short-term dan long-term memory)
Melakukan action plan (bisa jadi beberapa/sekuensial)
Untuk action yang membutuhkan natural language generation, mentransformasi dari semantic action menjadi kalimat bahasa yang dituju.

Dengan asumsi bahwa modul parsing dari syntactic menjadi semantic dan sebaliknya sudah berfungsi baik (kenyataannya belum ya), maka tahap yang paling utama adalah tahap 4 yaitu mencari action plan.

MOSES sebagai procedural machine learning framework mungkin dapat digunakan. Dengan input pada short-term memory sebagai berikut:

Titi baru putus
Titi sedang sedih (inferred prediction)
Fact #1 belum direspon

Maka kita dapat menjalankan MOSES untuk menggenerate "program" atau solusi dengan lingkup deme sesuai parameter di atas. "Program" yang digenerate di sini berupa semantic sentence. Scoring atau fitness sebuah program didasarkan pada kecocokan statement pada fact #1 dan #2.

Andai hasil generate beserta fitnessnya sbb:

Aku ikut senang = -0.5
Aku ikut sedih = +0.5

Maka robot akan memilih respon #2 untuk dilakukan.

PRnya adalah:

Membuat generator/permutasi respon berdasarkan deme
Menentukan algoritma skor fitness bila diberikan respon dan situasi memory

Algoritma scoring fitness itu sendiri dapat menggunakan machine learning sehingga robot belajar untuk mengenali sendiri, mana respon yang baik dan mana respon yang tidak tepat untuk situasi tertentu.

Apakah MOSES cocok untuk ini? Atau lebih baik menggunakan PLN (Probabilistic Logic Networks), pendekatan lain, atau bahkan hack sederhana? Kita lihat saja nanti. :-)

Knowledge Base Lumen Robot Friend

Pages