Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Markov Baby Name Generator (alexeymk.com)
37 points by AlexeyMK on July 16, 2012 | hide | past | favorite | 27 comments


I did something similar recently for a game I'm working on[1]. I didn't know it was called a Markov chain, but one thing I found is that if you take the two previous letters to generate the next one, the results are a little less random and seem a little more natural.

The more letters you take to generate the next one, the closer to the original source data you get, but with a big enough corpus of source data, you can still make random names using three or four letters.

[1] http://www.war-worlds.com/blog/2012/07/generating-names


It was probably still a Markov chain. But when Markov chains are used in this way, they are usually referred to as n-gram models (https://en.wikipedia.org/wiki/N-gram) where "n" is the number of previous items you investigate (in your case, a 2-gram or bigram model).

Ideally, you would want to go back infinitely many times. My understanding of why Markov models are great is because they reduce the computation required, by not looking back too far, yet they still achieve good results. The more you look back, the more possible combinations (in this case, of letters) there are to consider. As the number of possible combinations of letters increases, the chance of each combination appearing in your training corpus decreases.

N-gram models are often used at the whole word level. That is, instead of the "next letter", they are interested in the "next word". This leads to interesting ways to perform spell checking, based on the context of the surrounding words. For example take the following sentences:

"I did that to"

"I did that too"

The to is a spelling mistake, even though the word is an English word with correct spelling. Imagine how many times the phrase "I did that to" occurs in a large enough corpus, compared to "I did that too".

My understanding is that Google has the largest corpus and largest "N" (they have a 5-gram model). The cool thing is that they have released it under a CC license (http://books.google.com/ngrams/datasets).


I once made a word-based Markov chain and fed it a corpus of my instant messages. I did that to amuse myself. I did that too crudely, though; I didn't give it any notion of beginning or ending a sentence (although I kept capital letters and periods as part of each token, simply because that was easy to do). I don't think there were any other corpuses that I did that to.


Two years ago I heard (from a lecturer who knew people at google) that Google were using a 6-gram model internally.


Yep, Programming Pearls seems to agree with you - 'order-2' Markov chains are usually more natural than 'order-1' chains: http://www.cs.bell-labs.com/cm/cs/pearls/sec153.html.


I did something like this a few weeks ago using quotes from a loudly spoken team member.

Previously my workmates and I had started jotting down his sayings, and before we knew it had 2,000 or so entries in a database of the stuff he had said. I ran it though a Markov chain to see what sort of nonsense it would produce. My favorite thing that came out of it so far is the following,

"While I am changing my underwear people should check my email. Its an old Greek saying mate."


I had a similar idea for generating test data for a relational databases. It seems that for test data the boundary conditions and exceptional cases are sort of easy but it is the common stuff that is harder to fake. My ides is to create test data for numeric columns by estimating statistical parameters and use those in conjunction with a random number generator to make 100-500 rows of fake data. But for the first and last names (and other textual columns) I've been thinking about modeling the data using Markov processes to be able to come up with fake names and addresses that are somewhat close to the real data. I think that once you have a good statistical model you could export that and outsource testing more easily without compromising confidential information. If things like average salary were considered confidential then that could be skewed as a kind of obfuscation step.


That sounds cool. The term for what you want to prevent is "statistical disclosure" (http://scholar.google.com.au/scholar?q=statistical+disclosur...).

That is: How do I release interesting and useful data, while still preventing the disclosure of important information such as salary, or SSN for those of you in the US.


I built one of these quite awhile ago to help a friend pick his babies name.

https://github.com/harperreed/Baby-Chains

The names it generates are hilarious.

Markov chains are wonderful things. A good markov chain bot will really spice up a company IRC channel.


Heh, glad to know I'm not the only one! Seems like you have a far larger set of initial names, very cool.


Cool! I'll have to remember this when we can't pick a name for our next baby. A little hit or miss, but for a geeky name it is better than little Bobby tables...


Ahh, Markov chains. I pulled all my favourite quotes into a text file and used them to generate new quotes. Some nice ones:

What you do speaks so loudly that I drink this beer.

Premature optimisation is the immemorial refuge of the most troubled mind.

Collective judgement of new ideas is so often wrong that it picks up confidence as it appears.

How many seconds are there in a cellar on a rainy day?

The music business is a higher revelation than philosophy.

Art is a sign of intelligence.

My choice early in life is to have dinner.


Slightly off-topic: Does anyone know of tree(?) graphs of markov chains for certain bodies of words? Eg the probability of following characters for a certain book, showing only the more probable choices. That might look pretty cool.


Just so you know, I'm looking at this from my Android 4.1 stock browser and the entire page is blinking on and off randomly like some kind of joke. I can't scroll down because it seems as if it's constantly reloading itself.


Appreciate the heads-up. I think partially this is third-party stuff loading and partially the lack of pagination. It's on my to-do to fix; thank you.


What does it have to do with "Jekyll" in the title?! It reads "Weekend Hack: A Markov Baby Name Generator in Jekyll"


What I wish it was: honeypot for mindless RT-ers.

What it actually was: I have no idea. Fixed.


I built a bunch of stuff around this for domain names. I should open source it.


Cool! I was thinking of doing that. Where'd you get the startup names database, CrunchBase?


I had a bunch of sources. None with really good results, so it didn't really matter... I should work on it more. I have a much more expansive idea involving big relationship graphs and whatnot, but actual startup is taking up too much time. Someday.


I did something like that with CrunchBase at https://github.com/darius/languagetoys (companynames.py)


I am Weenan, son of Mindron.


Why is it called a "baby" name generator and not just a "name generator"?


Because the vast majority of people who need to chose a name are doing so for a baby, not someone older. Obviously in this case it's not really designed to be a useful tool, but it's still based on the idea of helping parents pick a name for their new child.


Here are a few "gems" from the twitter feed:

    C would make an awesome boy's name
    Ieahaholijayson would make an awesome boy's name.
    Thinking about a boy's name? how about Chosowex?
Pretty damn horrible.


Heh, agreed - it's definitely a mixed bag, that's what makes it fun. I've cut it off to a minimum of three character names since; that said, IMHO "C" is a pretty sweet name.

Some of the names aren't so bad, though: Marin, Harker, Gacon... Once in a while it'll generate a real name by accident. Those are my favorite.


Harker Chosowex Gacon, at your service!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: