Monday, October 16, 2006

More on the horoscope remix project...

So my plan so far for remixing horoscopes doesn't seem to be going so well. It went like this:

- Take in the text or texts to be mixed and tag them, so you know what every word's part of speech is.
- Shuffle all the words together randomly.
- Do something like simulated annealing to hillclimb towards a more sensible output text: at every step, swap two words somewhere in the shuffled text, and keep the swap if we think the series of words that it would produce is more likely (as determined by looking at our transition-probability model) than the previous series of three words in that area -- and sometimes make bad choices, probabilistically, to avoid getting stuck in local maxima. Now our transition-probability model works in terms of parts of speech ("tags", to those hip to the NLP lingo), and we learned those tables from tagging a previous corpus...

I'm not quite sure why this isn't working. It might be that using trigram transition probabilities (on the tags) doesn't capture enough structure to get coherent sentences? Because this method doesn't produce coherent sentences very well.

I'm thinking about what else I could do. Perhaps I could use longer n-grams in the model -- or maybe I should look to use a parser, and reward the hill-climbing when it produces longer and longer parsable sentence chunks. The other possibility is that maybe my tagger isn't as accurate as I think it is... it could be mislabeling more words than expected (and it's expected to mislabel a bunch of them.)

The problem is totally not that I'm training my models on James Joyce.

5 comments:

Anonymous said...

I'm curious as to why you didn't go with the more common Markov model approach?

Unknown said...

*nods* Well, that's sort of what I did -- just in terms of parts of speech. The usual markov-chain-text-generator thing starts with the first word then picks subsequent words until it finds the "END" token... that'd probably work pretty well here, actually...

I wanted to do a cut-up that used all the words -- but it'd almost certainly work better if I'm okay with leaving some out!

(thanks!)

Unknown said...

By which I mean -- "next I'll try a different approach that's more like the usual Markov-chain-generators but only takes into account tags instead of individual words because you had me think about it a little harder".

Thank'y kindly! :)

Anonymous said...

You could use a set, remove elements when you use them and then restart a chain with the remaining words. Maybe...

Bluebottle said...

Trigram probabilities with a small corpus aren't going to give you enough information to go on, and n-grams with higher n would be even worse - how about dropping back down to bigram probabilities instead?