I found 4 threads uses less memory than 2 (in average). Will check for 8 threads.
And I digged his sourcecode. He really added asyc I/O in third phase.