Piro2 - Blog

Apr 23rd, 2017 - NLU work continues

I have been on an Easter vacation this past week, so I have managed to work on Piro2 a couple of hours almost every day. I have been continuing my work on the natural language understanding features. For the past couple of days I have been working on the sentence handlers for the verb "look", so that I can ask questions and give commands with sentences containing that verb. Piro2 can move it's head to look at the given direction, and it can also let me know where it is currently looking, or where it has recently looked.

Yesterday I switched to using the JSGF grammar on Pocket Sphinx speech recognition, as that will considerably improve the recognition accuracy. It only needs to decide between given sentences when using the grammar, instead of trying to recognize every single word in the vocabulary. This worked so well, that today I decided to try and capture a video of me speaking simple sentences with the word "look" in them, and Piro2 then parsing the sentences and either replying to questions or turning it's head as required.

The image quality of the video is not all that great, as I did not have enough room on the small table to put my cell phone close to my robot, so I took the video a bit further away and used some zoom. I hope to figure out a better quality solution for future videos. Anyways, the video is otherwise unedited, except that I added some subtitles in case you have trouble understanding my or my robot's English. The speech recognition stil has trouble deciding whether I said "left" or "right", but other than that issue, the robot understood me surprisingly well.

I had also added small head nod and head shake for yes and no answers that the robot gives, to make the robot look a bit more interactive. I still need to make the head turning more human-like by smoothing the servo curves, but I am currently more interested in continuing my work on the speech understanding.

Apr 16th, 2017 - Natural Language Understanding

After the last blog post I again took a bit of a break from working on Piro2. I had somewhat lost my motivation for working on the project, as I had gotten quite frustrated with the poor speech recognition accuracy. My goal is to be able to converse with my robot, and that is pretty useless if the robot does not understand me at all. I had been testing various ways to improve the accuracy, but the only thing that considerably improved the accuracy was to switch to a list of predetermined commands or sentences, and that limits the whole idea quite seriously.

However, after reading Perilous Waif by E. William Brown, I again got motivated to continue working on my robot. That book has quite fun and interesting interactions between androids and humans (along with other interesting ideas), so it is recommended reading for all science fiction fans. I decided to forget the speech recognition problems for now, instead focusing on what happens after I have managed to capture a sentence of speech. I created a very simple text chat interface for my robot, so that I can type a sentence, and the robot code can then try to parse and understand that sentence.

Link-grammar integration

By the end of last year I had managed to create a simple link-grammar dictionary, but I had not yet managed to integrate link-grammar into my Piro2 project. In the beginning of March I worked on this integration, and managed to get the link-grammar code to compile with my Piro2 sources. In the end I only needed to make changes to two link-grammar source modules:

From link-grammar to what?

Next, I spent several days trying to figure out how to utilize the link-grammar parse graph for actual natural language understanding (NLU). After a week or so of not getting any progress done, I decided to just start coding something, the idea being to get a better understanding of the link-grammar graph structures. I began creating specific routines for specific link-grammar link orders, for example a simple sentence "I see" generates links RW, Wd and S, so I created a routine called RW_Wd_S() to handle such simple subject + verb sentences. I soon realized that it might be a good idea to handle all tense forms of the verb of such short sentences in the same routine, so I used the British Council pages as a reference for all the tense forms:

FormLink-grammarTense
I seeRW_Wd_Spresent simple
I am seeingRW_Wd_S_Ppresent continuous
I have seenRW_Wd_S_PPpresent perfect
I have been seeingRW_Wd_S_PP_Ppresent perfect continuous
I sawRW_Wd_Spast simple
I was seeingRW_Wd_S_Ppast continuous
I had seenRW_Wd_S_PPpast perfect
I had been seeingRW_Wd_S_PP_Ppast perfect continuous

I also decided that the negative forms (like "I did not see", "I haven't been seeing") might be good to handle in the same routines as well. I spent a week or so coding such routines, but then ran into a problem when the sentences became more complex. Since my routines did not check the link targets, only the link labels, the exact same link labels in the same order could be generated from two quite different sentences. I decided, that this was not the correct way to handle the link-grammar results, and went back to the drawing board, so to speak.

In the beginning of April I decided to start working on a system where I just build a list of all the nouns and verbs that link-grammar encounters, along with a list of adjectives and prepositions. Each verb has a link to the subject noun, and up to two object nouns, together with the tense and negation of the verb, and possible modal or auxiliary verb connected to the verb. Each noun has a link to it's determiner(s). Each adjective has a target link (to a noun or a verb that it modifies), and a list of adverbs that modify this adjective. The preposition list is a collection of pretty much all the other parts of the sentence, with links to whatever word the preposition modifies.

After a few days working on this new system, I was able to build the internal noun/verb/etc structures for most of the sentences I found from the random list of sentences at http://www.englishinuse.net/. Sentences with commas are still a problem, but on the other hand, there are no commas in the spoken (speech recognized) language, so I will need to handle such sentences in some other way in any case.

Database of concepts

For the past couple of days I have been working on the next step of the natural language understanding. I began to use an SQLite database in Piro2, and created a few tables that would contain entities, properties and instances of entities that Piro2 would have knowledge of. Getting SQLite to work on a Raspberry Pi is pretty simple, I just needed to do these simple steps:

sudo apt-get install sqlite3
sudo apt-get install libsqlite3-dev
sqlite3 piro2.db
	

I added a couple of entities, "human" and "robot" to the database, both with a single instance. Both of those entities has a "name" property, so that Piro2 now has a database that contains a "human" instance with a name "Patrick", and a "robot" instance with a name "Piro". I also added the basis of self-awareness by making my code understand that the "robot" instance named "Piro" means itself.

My simple chat interface can already handle questions like "Who are you?" ("My name is Piro.") and "What are you?" ("I am a robot."). For fun, I also used a list from http://hello.faqcow.com/howareyou100.html to give a random reply for the "How are you?" question. Here below is an example of the current output (including various debug prints) for a "how are you" question:

how are you
s = 'LEFT-WALL'
s = 'how'
s = 'are'
s = 'you'
s = 'RIGHT-WALL'

    +--Wq--+AFh+SIp+
    |      |   |   |
LEFT-WALL how are you


RW
Wq
Found Wq: Where/When/Why/Object/Adjectival question.
AFh
SIpx
Category=10
NOUN table:
idx=3 (you), role=S, pp_tag=27
 - Adjectives:
VERB table:
idx=2 (are), modal=0, tense=1, neg=1
 - Subject: 3 (you)
 - Adjectives: 1 (how)
PREP table:
Handling 'how' question...
Handling 'how be' question...
I am feeling happier than ever!
	

I think I will code a specific handler for each verb that Piro2 should understand, so that it can then reply to questions like "Can you hear me?" or "What do you see?". It can already handle various questions using the verb "be" (in addition to those mentioned above), like for example "Is your name Piro?" ("Yes, my name is Piro.") or "Are you a human?" ("No, I am a robot.").

The next steps are to enhance both the code and the database with new properties and handlers for new verbs. It is possible that I still need to change my approach to natural language understanding, but for now I will continue with this approach. At some point I will again activate the speech recognition, and check what happens when I try to speak these questions and let Piro2 then speak the replies. If I get that working, it might even be worth taking a video of.

Jan 29th, 2017 - Piro2 work continues after a short hiatus

For the past couple of weeks I have been distracted by some other projects and issues, and thus have not been working on Piro2 for a while now. One of the issues was that I had a couple of hard disk failures in my NAS server machine. First one of my oldest 80GB hard disks that I had used as a RAID0 pair broke down, which meant I lost all data I had on both of those disks. Luckily I had used this disk only for storing some temporary files, so I did not lose anything important. However, just a week after that, I lost another disk, this time a terabyte disk on a RAID1 array. Strangely, the disk that broke down was the newest of all my disks, but even it was almost three years old. Of course it had only a two-year warranty. I had to purchase a new terabyte disk and rebuild the RAID1 array. Luckily all of that went fine and I did not lose any data while replacing the disk.

Anyways, now that I am back working on Piro2, the first step was to see what happens if I try to increase the face detection image size from 320x240 up to 1280x720. This turned out to be a rather simple change, I just needed to adjust some hard-coded values in my ASM routines so that it handled this size properly. The Haar cascade face detection builds a cascade of different scale images that it uses for the detection, and increasing the image size increased the number of these scaled images to 36. With the original 320x240 image size the algorithm used around 20 scaled images. The increased size and number of images dropped the face detection rate to around one frame per second.

I am thinking of using motion detection to try to limit the areas of the image where face detection would be useful, so I next began working on a motion detection algorithm. In principle this is pretty simple, I just need to compare two images taken at different times, and detect the pixels that have changed. This was also a nice routine to attempt to use NEON ASM code, and I even learned about a couple of new NEON ASM operations that I had not used before. I thought that the motion detection image does not need to have the full 1280x720 resolution, so I decided to go down to 160x90 resolution for the motion detection image:

The way my motion detection algorithm currently works, is by always storing a smoothed 640x360 image together with the original 1280x720 camera capture image, at 30 frames per second. Then at around one frame per second (as it is currently connected to the face detection routine) I compare the latest 640x360 image that is further smoothed down to 160x90, with the 640x360 image (again smoothed down to 160x90) from the previous motion detection step, which currently means about one second ago. There is still a lot of room for improvement, for example, it is unnecessary to re-smooth the previous image again when comparing the images. I did that just because it made the algorithm simpler, and I can easily test different ways of smoothing the image. This is still very much a work in progress.

Here below is the full motion detection image calculation ASM routine. The interesting NEON ASM operations that I use are:

//------------------------ motion_detect ------------------------
//
// Input:
//	r0 = pointer to output motion detection map (160x90)
// 	r1 = pointer to new imgbuffer (640x360)
// 	r2 = pointer to previous imgbuffer (640x360)
//
motion_detect:
	pld		[r1]
	pld		[r2]
	push   		{r4-r11, lr}
	vpush		{d8-d23}
	vpush		{d24-d31}
	add		r3, r1, #640
	add		r4, r2, #640
	add		r5, r1, #2*640
	add		r6, r2, #2*640
	add		r7, r1, #3*640
	add		r8, r1, #3*640
	mov		r9, #360				// r9 = Y loop counter
1:	mov		r10, #640				// r10 = X loop counter
2:	vldmia		r1!, {d0-d3}				// Load 32 pixels from new imgbuffer first row
	vldmia		r2!, {d4-d7}				// Load 32 pixels from previous imgbuffer first row
	vldmia		r3!, {d8-d11}				// Load 32 pixels from new imgbuffer second row
	vldmia		r4!, {d12-d15}				// Load 32 pixels from previous imgbuffer second row
	vldmia		r5!, {d16-d19}				// Load 32 pixels from new imgbuffer third row
	vldmia		r6!, {d20-d23}				// Load 32 pixels from previous imgbuffer third row
	vldmia		r7!, {d24-d27}				// Load 32 pixels from new imgbuffer fourth row
	vldmia		r8!, {d28-d31}				// Load 32 pixels from previous imgbuffer fourth row
	pld		[r1]
	pld		[r2]
	pld		[r3]
	pld		[r4]
	pld		[r5]
	pld		[r6]
	pld		[r7]
	pld		[r8]
	//-----
	// Average the four rows of pixels together, new imgbuffer
	//-----
	vrhadd.u8	q0, q4
	vrhadd.u8	q1, q5
	vrhadd.u8	q8, q12
	vrhadd.u8	q9, q13
	vrhadd.u8	q0, q8
	vrhadd.u8	q1, q9
	//-----
	// Average the four rows of pixels together, previous imgbuffer
	//-----
	vrhadd.u8	q2, q6
	vrhadd.u8	q3, q7
	vrhadd.u8	q10, q14
	vrhadd.u8	q11, q15
	vrhadd.u8	q2, q10
	vrhadd.u8	q3, q11
	//-----
	// Average the four adjacent pixels together, new imgbuffer
	//-----
	vpaddl.u8	q0, q0
	vpaddl.u8	q1, q1
	vpadd.i16	d0, d0, d1
	vpadd.i16	d1, d2, d3
	vqrshrun.s16	d0, q0, #2
	//-----
	// Average the four adjacent pixels together, previous imgbuffer
	//-----
	vpaddl.u8	q2, q2
	vpaddl.u8	q3, q3
	vpadd.i16	d4, d4, d5
	vpadd.i16	d5, d6, d7
	vqrshrun.s16	d2, q2, #2
	//-----
	// Calculate and store the difference between new imgbuffer and previous imgbuffer
	//-----
	subs		r10, #32
	vabd.u8		d0, d2
	vstmia		r0!, {s0,s1}				// Save the resulting 8 pixels to the target.
	bgt		2b					// Back to X loop if X loop counter still > 0
	//-----
	// One row done, prepare for the next row
	//-----
	subs		r9, #4
	add		r1, #3*640
	add		r2, #3*640
	add		r3, #3*640
	add		r4, #3*640
	add		r5, #3*640
	add		r6, #3*640
	add		r7, #3*640
	add		r8, #3*640
	bgt		1b					// Back to Y loop if Y loop counter still > 0
	vpop		{d24-d31}
	vpop		{d8-d23}
	pop		{r4-r11, pc}				// Return to caller

I also did some minor hardware work, I attached suitable connectors to the head and neck servos, so that I can use a separate power source for them (and also for the speaker). The next step would be to implement an algorithm that would move the head of my robot based on the motion detection image. The robot should turn to look at whatever movement it detects. This made me think about implementing a sort of "instinct" behavioral layer, so that the robot follows these instincts whenever the higher AI routines do not override the behavior. One such instinct would be to look towards any movement.

Previous blog entries