Page 1 of 1

getting a word frequency list from ebook files?

Posted: Mon Oct 29, 2012 5:36 pm
by Whistler
this guy is learning english, and to do his reading, he makes wordlists to learn based on what words are in the book: http://blog.rinik.net/hacking-language

Over at WaniKani, we were thinking this would be awesome to do with Japanese http://www.wanikani.com/chat/kanji-and-japanese/721 (or your learning language of choice). Do any of you know how that would work?

Re: getting a word frequency list from ebook files?

Posted: Tue Oct 30, 2012 6:15 am
by Katya
How it would work . . . in terms of programming?

Re: getting a word frequency list from ebook files?

Posted: Tue Oct 30, 2012 10:51 am
by Whistler
yes. So for English it seems a lot easier, but I don't really know how Japanese characters are displayed in ebooks, and then since there aren't any spaces you have to figure out where the words are... but it looks like some people are brainstorming over at WaniKani, just thought some of you would find it an interesting problem as well.

Re: getting a word frequency list from ebook files?

Posted: Tue Oct 30, 2012 8:10 pm
by Katya
Ah. Yeah, it does sound interesting, but I can't read the Japanese forum without registering on the site and I'm not motivated enough to do so.

Re: getting a word frequency list from ebook files?

Posted: Tue Oct 30, 2012 8:58 pm
by Whistler
well, here's the most relevant post, if anyone else was curious about a possible solution to this problem:
zyaga of House Turtles says...
Moonstruck said...

That's an AWESOME idea! And after some searching...I think I may have found something.

http://neon.niederlandistik.fu-berlin.de/en/textstat/

TextStat. I think it's meant for use in English, but it worked in Spanish, too. I tried it with Japanese and it worked, but I'm not sure how accurate it was. Anyone with better Japanese knowledge than me want to test it out and let us know how accurate it is? I just found a Japanese story online and copied it into a Word document and used that. (To be specific, I used this: http://www.firegrubs.com/images/chokoch ... le0027.pdf)
TextStat, like many others, only work with documents where the Japanese text has spaces between the words. However, MeCab is a Japanese text parser and it has it's own Python module. I've got a quick example working that will parse Japanese from a txt file (I imagine it could be upgraded to parse text from a PDF directly as well), and then do 1 of 2 things.

1) Count everything up and dump the results to a text file listing every word along with its frequency (highest at top, lowest at bottom).
2) Save a text file simply containing all of the words split up via text.
Option 2 would allow you to use that program you just found.
However, Option 1 would allow me (or anyone if we want to turn this into an actual group project - I'd be glad to share the source), to customize the results in any way we see fit. One example being that I have already implemented a config file with an "exclude" list that we can exclude things such as は, を, も, 。, etc.

You can download my example script here: https://www.dropbox.com/s/6u57vkvno9ko3kk/jcount.zip
Extract files
Install mecab-0.994.exe (make sure this installs to default location)
Open Command Prompt (cmd)
cd c:\whatever\directory\jcount
Run : jcount.exe -i jap.txt
Look at results.txt
It's anything but polished. I just threw it together this morning.

You can also run jcount.exe with no options to see the help.

This will count the words: jcount.exe -i jap.txt -c

I repeat, if you do not install the mecab exe first it will NOT work.

(Even then it may not work, as I haven't tested it on someone elses machine yet :D )

It's a EXE compiled from python code, so yeah... may have dependencies I don't know about.

Anyways, if someone tries it out, can you please let me know if it works for you. ^^;;

Re: getting a word frequency list from ebook files?

Posted: Wed Oct 31, 2012 11:47 am
by Katya
Huh. I hope it works out!