Show HN: Markov Baby Name Generator

codeka · on July 16, 2012

I did something similar recently for a game I'm working on[1]. I didn't know it was called a Markov chain, but one thing I found is that if you take the two previous letters to generate the next one, the results are a little less random and seem a little more natural.

The more letters you take to generate the next one, the closer to the original source data you get, but with a big enough corpus of source data, you can still make random names using three or four letters.

[1] http://www.war-worlds.com/blog/2012/07/generating-names

pserwylo · on July 16, 2012

It was probably still a Markov chain. But when Markov chains are used in this way, they are usually referred to as n-gram models (https://en.wikipedia.org/wiki/N-gram) where "n" is the number of previous items you investigate (in your case, a 2-gram or bigram model).

Ideally, you would want to go back infinitely many times. My understanding of why Markov models are great is because they reduce the computation required, by not looking back too far, yet they still achieve good results. The more you look back, the more possible combinations (in this case, of letters) there are to consider. As the number of possible combinations of letters increases, the chance of each combination appearing in your training corpus decreases.

N-gram models are often used at the whole word level. That is, instead of the "next letter", they are interested in the "next word". This leads to interesting ways to perform spell checking, based on the context of the surrounding words. For example take the following sentences:

"I did that to"

"I did that too"

The to is a spelling mistake, even though the word is an English word with correct spelling. Imagine how many times the phrase "I did that to" occurs in a large enough corpus, compared to "I did that too".

My understanding is that Google has the largest corpus and largest "N" (they have a 5-gram model). The cool thing is that they have released it under a CC license (http://books.google.com/ngrams/datasets).

waterhouse · on July 16, 2012

I once made a word-based Markov chain and fed it a corpus of my instant messages. I did that to amuse myself. I did that too crudely, though; I didn't give it any notion of beginning or ending a sentence (although I kept capital letters and periods as part of each token, simply because that was easy to do). I don't think there were any other corpuses that I did that to.

user24 · on July 16, 2012

Two years ago I heard (from a lecturer who knew people at google) that Google were using a 6-gram model internally.

AlexeyMK · on July 16, 2012

Yep, Programming Pearls seems to agree with you - 'order-2' Markov chains are usually more natural than 'order-1' chains: http://www.cs.bell-labs.com/cm/cs/pearls/sec153.html.

boyter · on July 16, 2012

I did something like this a few weeks ago using quotes from a loudly spoken team member.

Previously my workmates and I had started jotting down his sayings, and before we knew it had 2,000 or so entries in a database of the stuff he had said. I ran it though a Markov chain to see what sort of nonsense it would produce. My favorite thing that came out of it so far is the following,

"While I am changing my underwear people should check my email. Its an old Greek saying mate."

johnwatson11218 · on July 16, 2012

I had a similar idea for generating test data for a relational databases. It seems that for test data the boundary conditions and exceptional cases are sort of easy but it is the common stuff that is harder to fake. My ides is to create test data for numeric columns by estimating statistical parameters and use those in conjunction with a random number generator to make 100-500 rows of fake data. But for the first and last names (and other textual columns) I've been thinking about modeling the data using Markov processes to be able to come up with fake names and addresses that are somewhat close to the real data. I think that once you have a good statistical model you could export that and outsource testing more easily without compromising confidential information. If things like average salary were considered confidential then that could be skewed as a kind of obfuscation step.

pserwylo · on July 16, 2012

That sounds cool. The term for what you want to prevent is "statistical disclosure" (http://scholar.google.com.au/scholar?q=statistical+disclosur...).

That is: How do I release interesting and useful data, while still preventing the disclosure of important information such as salary, or SSN for those of you in the US.

harper · on July 16, 2012

I built one of these quite awhile ago to help a friend pick his babies name.

https://github.com/harperreed/Baby-Chains

The names it generates are hilarious.

Markov chains are wonderful things. A good markov chain bot will really spice up a company IRC channel.

AlexeyMK · on July 16, 2012

Heh, glad to know I'm not the only one! Seems like you have a far larger set of initial names, very cool.

wingspan · on July 16, 2012

Cool! I'll have to remember this when we can't pick a name for our next baby. A little hit or miss, but for a geeky name it is better than little Bobby tables...

mlawlerau · on July 16, 2012

Ahh, Markov chains. I pulled all my favourite quotes into a text file and used them to generate new quotes. Some nice ones:

What you do speaks so loudly that I drink this beer.

Premature optimisation is the immemorial refuge of the most troubled mind.

Collective judgement of new ideas is so often wrong that it picks up confidence as it appears.

How many seconds are there in a cellar on a rainy day?

The music business is a higher revelation than philosophy.

Art is a sign of intelligence.

My choice early in life is to have dinner.

aw3c2 · on July 16, 2012

Slightly off-topic: Does anyone know of tree(?) graphs of markov chains for certain bodies of words? Eg the probability of following characters for a certain book, showing only the more probable choices. That might look pretty cool.

sofal · on July 16, 2012

Just so you know, I'm looking at this from my Android 4.1 stock browser and the entire page is blinking on and off randomly like some kind of joke. I can't scroll down because it seems as if it's constantly reloading itself.

AlexeyMK · on July 16, 2012

Appreciate the heads-up. I think partially this is third-party stuff loading and partially the lack of pagination. It's on my to-do to fix; thank you.

icehero · on July 16, 2012

What does it have to do with "Jekyll" in the title?! It reads "Weekend Hack: A Markov Baby Name Generator in Jekyll"

AlexeyMK · on July 16, 2012

What I wish it was: honeypot for mindless RT-ers.

What it actually was: I have no idea. Fixed.

joshu · on July 16, 2012

I built a bunch of stuff around this for domain names. I should open source it.

AlexeyMK · on July 16, 2012

Cool! I was thinking of doing that. Where'd you get the startup names database, CrunchBase?

joshu · on July 16, 2012

I had a bunch of sources. None with really good results, so it didn't really matter... I should work on it more. I have a much more expansive idea involving big relationship graphs and whatnot, but actual startup is taking up too much time. Someday.

abecedarius · on July 16, 2012

I did something like that with CrunchBase at https://github.com/darius/languagetoys (companynames.py)

threepipeproblm · on July 16, 2012

I am Weenan, son of Mindron.

pan69 · on July 16, 2012

Why is it called a "baby" name generator and not just a "name generator"?

swores · on July 16, 2012

Because the vast majority of people who need to chose a name are doing so for a baby, not someone older. Obviously in this case it's not really designed to be a useful tool, but it's still based on the idea of helping parents pick a name for their new child.

rorrr · on July 16, 2012

Here are a few "gems" from the twitter feed:

    C would make an awesome boy's name
    Ieahaholijayson would make an awesome boy's name.
    Thinking about a boy's name? how about Chosowex?

Pretty damn horrible.

AlexeyMK · on July 16, 2012

Heh, agreed - it's definitely a mixed bag, that's what makes it fun. I've cut it off to a minimum of three character names since; that said, IMHO "C" is a pretty sweet name.

Some of the names aren't so bad, though: Marin, Harker, Gacon... Once in a while it'll generate a real name by accident. Those are my favorite.

joshu · on July 16, 2012

Harker Chosowex Gacon, at your service!