zyaga of House Turtles says...
Moonstruck said...
That's an AWESOME idea! And after some searching...I think I may have found something.
http://neon.niederlandistik.fu-berlin.de/en/textstat/
TextStat. I think it's meant for use in English, but it worked in Spanish, too. I tried it with Japanese and it worked, but I'm not sure how accurate it was. Anyone with better Japanese knowledge than me want to test it out and let us know how accurate it is? I just found a Japanese story online and copied it into a Word document and used that. (To be specific, I used this:
http://www.firegrubs.com/images/chokoch ... le0027.pdf)
TextStat, like many others, only work with documents where the Japanese text has spaces between the words. However, MeCab is a Japanese text parser and it has it's own Python module. I've got a quick example working that will parse Japanese from a txt file (I imagine it could be upgraded to parse text from a PDF directly as well), and then do 1 of 2 things.
1) Count everything up and dump the results to a text file listing every word along with its frequency (highest at top, lowest at bottom).
2) Save a text file simply containing all of the words split up via text.
Option 2 would allow you to use that program you just found.
However, Option 1 would allow me (or anyone if we want to turn this into an actual group project - I'd be glad to share the source), to customize the results in any way we see fit. One example being that I have already implemented a config file with an "exclude" list that we can exclude things such as は, を, も, 。, etc.
You can download my example script here:
https://www.dropbox.com/s/6u57vkvno9ko3kk/jcount.zip
Extract files
Install mecab-0.994.exe (make sure this installs to default location)
Open Command Prompt (cmd)
cd c:\whatever\directory\jcount
Run : jcount.exe -i jap.txt
Look at results.txt
It's anything but polished. I just threw it together this morning.
You can also run jcount.exe with no options to see the help.
This will count the words: jcount.exe -i jap.txt -c
I repeat, if you do not install the mecab exe first it will NOT work.
(Even then it may not work, as I haven't tested it on someone elses machine yet
![Very Happy :D](./images/smilies/icon_e_biggrin.gif)
)
It's a EXE compiled from python code, so yeah... may have dependencies I don't know about.
Anyways, if someone tries it out, can you please let me know if it works for you. ^^;;