I'm too! I don't program in Java but I've never read that the compiler or JVM can do anything like that automatically (making the user's code running in parallel even if the user didn't specify anything).
I suspect that what he observes is the garbage collector running on more cores at once (AFAIK it is designed to run so now) and that the garbage collection was the cause of the slowness in Go too. I persume he managed to have a heck a lot of allocatations and that caused the slowness of Go and the full-throttle GC run in Java. But this is just a guess. If anybody knows more exact details please write!
Go can't be really mono-threaded unless you are trying to do so (or your program is really short).
A lot of the standard operations under the hood are in fact executed in goroutines in a pool of threads. IOs are, for example. So you found yourself gaining all the advantage of asynchronism without supporting the cost in the code... exactly like this example in Java.
Why doesn't my multi-goroutine program use multiple CPUs?
You must set the GOMAXPROCS shell environment variable or use the similarly-named function of the runtime package to allow the run-time support to utilize more than one OS thread.
Programs that perform parallel computation should benefit from an increase in GOMAXPROCS. However, be aware that concurrency is not parallelism.