I did something similar recently for a game I'm working on[1]. I didn't know it was called a Markov chain, but one thing I found is that if you take the two previous letters to generate the next one, the results are a little less random and seem a little more natural.
The more letters you take to generate the next one, the closer to the original source data you get, but with a big enough corpus of source data, you can still make random names using three or four letters.
It was probably still a Markov chain. But when Markov chains are used in this way, they are usually referred to as n-gram models (https://en.wikipedia.org/wiki/N-gram) where "n" is the number of previous items you investigate (in your case, a 2-gram or bigram model).
Ideally, you would want to go back infinitely many times. My understanding of why Markov models are great is because they reduce the computation required, by not looking back too far, yet they still achieve good results. The more you look back, the more possible combinations (in this case, of letters) there are to consider. As the number of possible combinations of letters increases, the chance of each combination appearing in your training corpus decreases.
N-gram models are often used at the whole word level. That is, instead of the "next letter", they are interested in the "next word". This leads to interesting ways to perform spell checking, based on the context of the surrounding words. For example take the following sentences:
"I did that to"
"I did that too"
The to is a spelling mistake, even though the word is an English word with correct spelling. Imagine how many times the phrase "I did that to" occurs in a large enough corpus, compared to "I did that too".
My understanding is that Google has the largest corpus and largest "N" (they have a 5-gram model). The cool thing is that they have released it under a CC license (http://books.google.com/ngrams/datasets).
I once made a word-based Markov chain and fed it a corpus of my instant messages. I did that to amuse myself. I did that too crudely, though; I didn't give it any notion of beginning or ending a sentence (although I kept capital letters and periods as part of each token, simply because that was easy to do). I don't think there were any other corpuses that I did that to.
I did something like this a few weeks ago using quotes from a loudly spoken team member.
Previously my workmates and I had started jotting down his sayings, and before we knew it had 2,000 or so entries in a database of the stuff he had said. I ran it though a Markov chain to see what sort of nonsense it would produce. My favorite thing that came out of it so far is the following,
"While I am changing my underwear people should check my email. Its an old Greek saying mate."
I had a similar idea for generating test data for a relational databases.
It seems that for test data the boundary conditions and exceptional cases are sort of easy but it is the common stuff that is harder to fake.
My ides is to create test data for numeric columns by estimating statistical parameters and use those in conjunction with a random number generator to make 100-500 rows of fake data.
But for the first and last names (and other textual columns) I've been thinking about modeling the data using Markov processes to be able to come up with fake names and addresses that are somewhat close to the real data.
I think that once you have a good statistical model you could export that and outsource testing more easily without compromising confidential information. If things like average salary were considered confidential then that could be skewed as a kind of obfuscation step.
That is: How do I release interesting and useful data, while still preventing the disclosure of important information such as salary, or SSN for those of you in the US.
Cool! I'll have to remember this when we can't pick a name for our next baby. A little hit or miss, but for a geeky name it is better than little Bobby tables...
Slightly off-topic: Does anyone know of tree(?) graphs of markov chains for certain bodies of words? Eg the probability of following characters for a certain book, showing only the more probable choices. That might look pretty cool.
Just so you know, I'm looking at this from my Android 4.1 stock browser and the entire page is blinking on and off randomly like some kind of joke. I can't scroll down because it seems as if it's constantly reloading itself.
I had a bunch of sources. None with really good results, so it didn't really matter... I should work on it more. I have a much more expansive idea involving big relationship graphs and whatnot, but actual startup is taking up too much time. Someday.
Because the vast majority of people who need to chose a name are doing so for a baby, not someone older. Obviously in this case it's not really designed to be a useful tool, but it's still based on the idea of helping parents pick a name for their new child.
Heh, agreed - it's definitely a mixed bag, that's what makes it fun. I've cut it off to a minimum of three character names since; that said, IMHO "C" is a pretty sweet name.
Some of the names aren't so bad, though: Marin, Harker, Gacon... Once in a while it'll generate a real name by accident. Those are my favorite.
The more letters you take to generate the next one, the closer to the original source data you get, but with a big enough corpus of source data, you can still make random names using three or four letters.
[1] http://www.war-worlds.com/blog/2012/07/generating-names