Japanese Spam Analysis (or Artificially Intelligent Teaching by Statistics)

by Javantea
Sept 25, 2008

INTRODUCTION

Japanese spam is a good indicator of text in Japanese language. It is also very good tool for understanding common Japanese speech. Most spam is designed to trick the recipient into e-mail or visiting a site. Unlike English spam, most Japanese spam is extremely well-written, targeted at the net savvy and quite well-educated Japanese audience. Also, since spam filters in Japan can pick out words much quicker (since Japan uses Kanji), spammers are using higher quality spam generators.

It is my opinion that Japan is ahead of the curve in spam, so I expect that English spam will be headed in the direction instead of visa- versa. On the other hand in a way Japanese spam lacks variety because the market is small compared to the US (which is small compared to any other advertising medium (websites, snail mail, magazines, tv, newspapers, even billboards)) and so the variety is extremely low. Variety is a trade-off made for quality in spam. The more quality, the less variety, Japanese spam is on the very high end of quality vs variety. As spam grows as a market, it is quite likely that quality and variety will both increase. Quantity, on the other hand is unlikely to grow considering the effectiveness of the medium decreases far faster than quantity grows. As an internet researcher, I am quite interested in the process and the design of different modes of communication especially if they invoke emotions (usually anger and frustration) so readily. Many developers and researchers spend huge amounts of time, energy, and money (including myself) on the topic of spam, and so contributing however small to the common base of knowledge on this subject is an everpresent goal. However today, I will be using spam as a tool to learn about other communciation topics rather than researching it directly. I hope the readers approve even if they may be more interested in something completely different.

I will use histograms of Japanese spam (each unicode character alone) in this essay to teach the reader about the Japanese language. I have a fairly good background in Japanese and can speak basic and almost conversational (at very low speed) Japanese. I have 4 years experience speaking Japanese and I spent one month in Japan. One of my hobbies there was decoding their often cryptic advertisements.

You could almost consider this essay a work of Quantative Linguistics, though I would probably be hard pressed to find a peer to review my paper. If you'd like to peer review this paper or publish it in your Journal (no matter what title it is), feel free to e-mail me or leave a comment.

An excellent paper on decoding real Japanese texts can be found here: http://www.cs.cornell.edu/home/llee/papers/segmentjnle.pdf It was obviously written with English-speaking readers studying Japanese texts in mind. The paper's references should be my bibliography, but it isn't because I have no possibility of reading this wealth of information on Japanese language processing.

Method

The method of creating a database and histogram of spam can be reproduced by anyone who has a directory full of flatfile spam e-mails with some of them being Japanese. If you do not have any Japanese spam, I noticed that creating a website with lots of Japanese characters coincided with Japanese spam increase by a rather large percent. I expect that Japanese spammers are probably using search engines to find websites with Japanese keywords and manually scraping for e-mail addresses. The tools I used to create this database are normal GNU OS utilities as well as custom Python scripts.

# japanese_spam2.sh
# by Javantea
# Oct 26, 2008

# Automatically analyze a directory full of spam

SPAMDIR=~/Mail/.inbox.directory/spam1/
TODAY=$(date +%Y%m%d)

# Put all names of Japanese spam into a text file.
grep -r -i '^ *charset="SHIFT_JIS"' $SPAMDIR > japanese_spam1.txt
grep -r -i '^ *charset="iso-2022-jp"' $SPAMDIR >> japanese_spam1.txt
grep -r -i '^ *charset=utf-8' $SPAMDIR >> japanese_spam1.txt
grep -r -i '^ *charset="utf-8"' $SPAMDIR >> japanese_spam1.txt

# Replace colon (:) with space.
tr ':' ' ' < japanese_spam1.txt > japanese_spam1a.txt

# Create a directory and copy all spams into it.
mkdir spam$TODAY/; awk '{ print $1; }' japanese_spam1a.txt | xargs cp
# Show the user what you've done.
ls spam$TODAY/

# (Optional) Make a histogram of each file seperately for later use.
cd spam$TODAY/
A=$(find . -name '[0-9]*')
cd ..
mkdir spam$TODAY/histogram/
for file in $A; do
	python parse_japanese_email1.py --histogram "spam$TODAY/$file" > "spam$TODAY/histogram/$file"
done

# Make a total histogram of all spam emails.
find spam$TODAY/ -name '[0-9]*' | xargs python parse_japanese_email1.py > spam$TODAY/histogram/tot_hist3.txt

# Turn the histogram into a readable utf8 document.
python parse_japanese_email1.py --pwnhistogram spam$TODAY/histogram/tot_hist3.txt > spam$TODAY/histogram/tot_hist3_utf8.txt

# Sort the histogram by count descending.
sort -k 2 -n -r < spam$TODAY/histogram/tot_hist3_utf8.txt > spam$TODAY/histogram/tot_hist3_utf8_count.txt

Data

Raw Output in HTML format
[uploads/tot_hist3_utf8_count.html]

Raw Output in Text format
[uploads/tot_hist3_utf8_count.txt]

Analysis

The most common non-ascii by far is a wide space character (8920). It must be somehow inserted by some program when a person does something strange. In UTF-8 urlencoded it is %E3%80%80, which is U+3000 IDEOGRAPHIC SPACE, HTML entity 　.

の: 5000 By coincidence, there are 5000 instances of the hiragana syllable no. It is used as a sign of ownership or of attribution, so it is very common, especially in spam e-mail. An example of usage I would commonly use is "私の名前はジャワンテー" (Watashi no namae wa Jawantee translates into "My name is Javantea").

い: 4230 The hiragana syllable i is quite common in Japanese text. Histograms of Japanese text should follow this pattern as well. A common use of i is: "良い子供" (ii kodomo translates to "Good Child/Children").

て: 3445 The hiragana syllable te is quite common in Japanese text, especially e-mail spam. It is used as a for commands like do this, do that, etc. A common use of te for a command is: "ゆっくり言って下さい" (yukkuri itte kudasai translates to "Please speak slowly").

で: 3062 The hiragana syllable de is quite common in Japanese text since it is used for です (desu) and でした (deshita) which are the to be verbs (is, am, was, were, will be, should be, etc). It is also used for でわ (dewa translates to "then", "so", "well then", etc), でも (demo is the conjunction but), and by itself for the preposition at and by. A common use of de is "君は奇麗です" (kimi wa kirei desu translates to "You are pretty"). An additional note about the above sentence: since Japanese quite often omits words and phrases when the context of the current discussion can assume, so "君は" can be omitted if the previous sentence referenced the other person. Using de as the at preposition, we can say "此のがアーケードで買いました" (kono ga A-ke-do de kaimashita translates to "This Arcade at bought" meaning "I bought this at the arcade"). [5]

し: 2997 The hiragana syllable shi is quite common in Japanese text since it is used for でした (deshita translates to "was"), でしょう (deshou translates to "I think" or anything where uncertain). This character is also used in a variety of words commonly useful in spam: 一緒に参加してみない？ (isshou sankashiteminai? translates poorly to "Together occompany do you not want to?"). A far less common use of shi in an e-mail subject is this gem: スグ生中出しOKです (Sugu namachoudeshi OK desu does not translate well).

?: 2919 Note that the question mark is more common than many other syllables, but is less common than the above syllables. This means that questions occur quite often, but the above syllables are quite common.

す: 2817 The hiragana syllable su is used for です (desu), see above and ます (masu) which is a present tense modifier to verbs. Since present tense is used quite often and verbs occur in all sentences, this is quite commonly used. A use from spam I have here is: "成功率１３％は完全保証します。" (saikouritsu 13% wa kanzen hoshoushimasu translates to "Success Rate 13% is Perfect Guaranteed.").

ま: 2437 The hiragana syllable ma is used for ます (masu), see above and ました (mashita) which is a modifier for past tense verbs. It is also used as a modifier for the negative present tense verbs, ません (masen) and past tense verbs ませんでした (masen deshita). A usage in spam is: "理恵子の悩みを聞いてもらえませんか？" (translates to "Reiko's trouble hear didn't you?" meaning "Didn't you hear about Reiko's trouble?")

。: 2250 This character is the Japanese period. It is used for statements, but not always since it's a bit of a hack on the language. The only interesting thing about this statistic is how much it gets used, and to compare it to the ASCII period, 11391 (5:1 ratio). This means that in my dataset, periods used for headers (which were sadly not stipped) greatly outweigh Japanese periods.

な: 2170 The hiragana syllable na is used for spelling out the kanji ない (nai translates to negative, without, nothing) as well as a multitude of other uses. The first two uses I saw were: オバサン達なので... (Obasan-tachi nano de translates to "Old women because it is...") and "簡単な" (kantanna translates to "simply").

に: 2164 Preposition in on to from as well as other uses.

か: 2100 question participle, desukara dakara as well as many other uses.

る: 2088 verb modifier for dictionary form

>: 2063 Used for pointing and layout. >>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<

を: 2042 Preposition from, object marker.

と: 1897 particle and.

<: 1885 Used for pointing and layout.

は: 1810 Though pronounce ha most of the time, as a particle it is pronounced wa. It is the topic marker particle. A common use is "私はアメリカン人です" (watashi wa Amerikan jin desu translates to "I am American person.")

た: 1705

tab: 1649

*: 1630

ら: 1597

が: 1591

も: 1373

っ: 1327

お: 1284

り: 1217

会: 1080 The first actual kanji is more common than quite a few syllables most notably あ (a). This kanji means meeting and/or understanding. Sadly, it is not actual the most popular kanji in the Japanese language, so this means our data set is definitely skewed toward words used in e-mail spam. It is also possible that it could be skewed toward Japanese advertising e-mail or friendly e-mail, but more likely, this kanji is simply a popular word. A use I saw was 出会 (shukkai translates to "encounter"), you can guess the context.

れ: 961

性: 944 The second most popular kanji is rather telling, gender (sei) is often used for gender and sexual topics. The word for sexual intercourse (性交) accounts for probably nearly half of these mentions. Note that not far down the list, the second kanji in that word (交) is mentioned 509 times, over half as many times as sei. Non-sexual uses of this kanji are possible and common, since it's a very common idea, for example 性行 means "character and conduct", 性質 means "nature", "property", or "disposition", 陰性 means negative, and many other examples exist. Foreigners learning useful words for paperwork on a trip to Japan should memorize this kanji as well as the various words made with this kanji. Scientists, especially physicists and engineers need to know this kanji because it is used quite often in conjunction to describe properties such as: 感受性 (sensitivity), 揮発性 (volatile), 偽性 (pseudo), 現実性 (feasible), 合法性 (lawfulness), 水溶性 (water-soluble), 潜伏性 (latency), and so forth. This word is very versatile and cannot be added to bayesian spam filters without connecting it with the specific trailers that cause its meaning to be sexual. A common use of this in spam is "女性が男性へ謝礼を支払う・・・" (onnasei ga otokosei he sharei wo shiharai translates to "A woman pays reward to a man").

！: 899 The exclamation is used quite often in spam. No other explanation neccesary.

あ: 892

女: 860 The third most popular kanji (onna) gives away the source to any Japanese student, woman. No source of information anywhere in the entire body of works in human history has ever discussed women with this frequency. Only an author of post-modern poetry who pastes the character a hundred times in a row comes close to using this kanji at this frequency. It accounts for 860/606463 = 0.14% of all e-mail content. Comparatively, whitespace (including the ideographic space and tab) accounts for 63351/606463 = 10%. That means for every 74 whitespace, there is one mention of a woman. I think this proves more about spam than it does the human race though.

方: 631 The fourth most popular kanji is a bit of a surprise, (kata) means "person". It has many uses other than person of course, such as (hou) "side". Combinations often use the hou form or one of the many other forms. A very popular word, 彼方 (achira translates to "there") uses another form of the kanji.

出: 569 The fifth most popular kanji is de meaning "outflow/coming (going) out". Although the literal is similar to a popular English spam word, the most popular use of 出 is as part of the very common verb 出来る (できる dekiru means "to be able"). A use in a spam subject is: ご登録準備が出来ました (ごとうろくじゅんびができました gotourokujyunbigadekimashita means "Entry ready able").

交: 509 The sixth most popular kanji is maji meaning "trade or mix". A use in spam is 全国47都道府県・中高年肉体交流の会 which translates to "Meeting of nationwide 47 metropolis and districts middle to elderly-aged physical interchange"

無: 483 The seventh most popular kanji nai meaning "no/nothing/without" is quite useful because it's quite common in Japanese signs and advertising.

人: 435 The eighth most popular kanji nin means "person". It is used quite often in Japanese signs and communication.

番: 416 The ninth most popular kanji ban is used as a counter for position. For example 一番 (ichiban meaning "best/first") is used quite often in anime and film. A use in spam is 番組スタッフが明かす！ (ばんぐみスタッフがあかす！ bangumi staff ga akasu! which means "TV program staff is revealed!") In this case ban is used in the noun bangumi which means TV program.

Conclusion

I am very pleased with the results of this research. Using a very unuseful set of data, I was able to provide statistical analysis and retrieve interesting and quite useful data from it. The data set is very large and continues to grow daily, so analysis can actually continue as a permanent project.

I am very interested in continuing this research for projects such as my kanji handwriting analysis project and for amateur manga translation projects. Projects such as Kanjilish and the Unicode Fuzzer are also very interesting projects that would definitely benefit from the data in this project.