Japanese Spam Analysis (or Artificially Intelligent Teaching by Statistics)
by Javantea
Sept 25, 2008
Japanese AI version 0.1
[sig]
INTRODUCTION
Japanese spam is a good indicator of text in Japanese language. It is also very
good tool for understanding common Japanese speech. Most spam is designed to
trick the recipient into e-mail or visiting a site. Unlike English spam, most
Japanese spam is extremely well-written, targeted at the net savvy and quite
well-educated Japanese audience. Also, since spam filters in Japan can pick out
words much quicker (since Japan uses Kanji), spammers are using higher quality
spam generators.
It is my opinion that Japan is ahead of the curve in spam, so
I expect that English spam will be headed in the direction instead of visa-
versa. On the other hand in a way Japanese spam lacks variety because the
market is small compared to the US (which is small compared to any other
advertising medium (websites, snail mail, magazines, tv, newspapers, even
billboards)) and so the variety is extremely low. Variety is a trade-off made
for quality in spam. The more quality, the less variety, Japanese spam is on
the very high end of quality vs variety. As spam grows as a market, it is quite
likely that quality and variety will both increase. Quantity, on the other hand
is unlikely to grow considering the effectiveness of the medium decreases far
faster than quantity grows. As an internet researcher, I am quite interested in
the process and the design of different modes of communication especially if
they invoke emotions (usually anger and frustration) so readily. Many
developers and researchers spend huge amounts of time, energy, and money
(including myself) on the topic of spam, and so contributing however small to
the common base of knowledge on this subject is an everpresent goal. However
today, I will be using spam as a tool to learn about other communciation topics
rather than researching it directly. I hope the readers approve even if they
may be more interested in something completely different.
I will use histograms of Japanese spam (each unicode character alone) in
this essay to teach the reader about the Japanese language. I have a fairly
good background in Japanese and can speak basic and almost conversational (at
very low speed) Japanese. I have 4 years experience speaking Japanese and I
spent one month in Japan. One of my hobbies there was decoding their often
cryptic advertisements.
You could almost consider this essay a work of Quantative Linguistics, though I
would probably be hard pressed to find a peer to review my paper. If you'd like
to peer review this paper or publish it in your Journal (no matter what title
it is), feel free to e-mail me or leave a comment.
An excellent paper on decoding real Japanese texts can be found here:
http://www.cs.cornell.edu/home/llee/papers/segmentjnle.pdf
It was obviously written with English-speaking readers studying Japanese texts
in mind. The paper's references should be my bibliography, but it isn't because
I have no possibility of reading this wealth of information on Japanese
language processing.
Method
The method of creating a database and histogram of spam can be reproduced by
anyone who has a directory full of flatfile spam e-mails with some of them
being Japanese. If you do not have any Japanese spam, I noticed that creating a
website with lots of Japanese characters coincided with Japanese spam increase
by a rather large percent. I expect that Japanese spammers are probably using
search engines to find websites with Japanese keywords and manually scraping
for e-mail addresses. The tools I used to create this database are normal GNU
OS utilities as well as custom Python scripts.
# japanese_spam2.sh
# by Javantea
# Oct 26, 2008
# Automatically analyze a directory full of spam
SPAMDIR=~/Mail/.inbox.directory/spam1/
TODAY=$(date +%Y%m%d)
# Put all names of Japanese spam into a text file.
grep -r -i '^ *charset="SHIFT_JIS"' $SPAMDIR > japanese_spam1.txt
grep -r -i '^ *charset="iso-2022-jp"' $SPAMDIR >> japanese_spam1.txt
grep -r -i '^ *charset=utf-8' $SPAMDIR >> japanese_spam1.txt
grep -r -i '^ *charset="utf-8"' $SPAMDIR >> japanese_spam1.txt
# Replace colon (:) with space.
tr ':' ' ' < japanese_spam1.txt > japanese_spam1a.txt
# Create a directory and copy all spams into it.
mkdir spam$TODAY/; awk '{ print $1; }' japanese_spam1a.txt | xargs cp
# Show the user what you've done.
ls spam$TODAY/
# (Optional) Make a histogram of each file seperately for later use.
cd spam$TODAY/
A=$(find . -name '[0-9]*')
cd ..
mkdir spam$TODAY/histogram/
for file in $A; do
python parse_japanese_email1.py --histogram "spam$TODAY/$file" > "spam$TODAY/histogram/$file"
done
# Make a total histogram of all spam emails.
find spam$TODAY/ -name '[0-9]*' | xargs python parse_japanese_email1.py > spam$TODAY/histogram/tot_hist3.txt
# Turn the histogram into a readable utf8 document.
python parse_japanese_email1.py --pwnhistogram spam$TODAY/histogram/tot_hist3.txt > spam$TODAY/histogram/tot_hist3_utf8.txt
# Sort the histogram by count descending.
sort -k 2 -n -r < spam$TODAY/histogram/tot_hist3_utf8.txt > spam$TODAY/histogram/tot_hist3_utf8_count.txt
Data
Raw Output in HTML format
[uploads/tot_hist3_utf8_count.html]
Raw Output in Text format
[uploads/tot_hist3_utf8_count.txt]
Analysis
The most common non-ascii by far is a wide space character (8920). It must be
somehow inserted by some program when a person does something strange. In
UTF-8 urlencoded it is %E3%80%80, which is U+3000 IDEOGRAPHIC SPACE, HTML
entity .
の: 5000
By coincidence, there are 5000 instances of the hiragana syllable no. It is
used as a
sign of ownership or of attribution, so it is very common, especially in spam
e-mail. An example of usage I would commonly use is "私 の 名前 は ジャワンテー"
(Watashi no namae wa Jawantee translates into "My name is Javantea").
い: 4230
The hiragana syllable i is quite common in Japanese text. Histograms of
Japanese text
should follow this pattern as well. A common use of i is: "良い 子供" (ii
kodomo translates to "Good Child/Children").
て: 3445
The hiragana syllable te is quite common in Japanese text, especially e-mail
spam. It is
used as a for commands like do this, do that, etc.
A common use of te for a command is:
"ゆっくり 言って下さい" (yukkuri itte kudasai translates to "Please speak
slowly").
で: 3062
The hiragana syllable de is quite common in Japanese text since it is used for
です (desu) and でした (deshita) which are the to be verbs (is, am, was, were,
will be, should be, etc). It is also used for でわ (dewa translates to "then",
"so", "well then", etc), でも (demo is the conjunction but), and by itself for
the preposition at and by. A common use of de is "君 は 奇麗 です" (kimi wa
kirei desu translates to "You are pretty"). An additional note about the above
sentence: since Japanese quite often omits words and phrases when the context
of the current discussion can assume, so "君 は" can be omitted if the previous
sentence referenced the other person. Using de as the at preposition, we can
say "此の が アーケード で 買いました" (kono ga A-ke-do de kaimashita translates
to "This Arcade at bought" meaning "I bought this at the arcade"). [5]
し: 2997
The hiragana syllable shi is quite common in Japanese text since it is used for
でした (deshita translates to "was"), でしょう (deshou translates to "I think"
or anything where uncertain). This character is also used in a variety of words
commonly useful in spam: 一緒に参加してみない? (isshou sankashiteminai?
translates poorly to "Together occompany do you not want to?").
A far less common use of shi in an e-mail subject is this gem:
スグ生中出しOKです (Sugu namachoudeshi OK desu does not translate well).
?: 2919
Note that the question mark is more common than many other syllables, but is
less common than the above syllables. This means that questions occur quite
often, but the above syllables are quite common.
す: 2817
The hiragana syllable su is used for です (desu), see above and ます (masu)
which is a present tense modifier to verbs. Since present tense is used quite
often and verbs occur in all sentences, this is quite commonly used. A use from
spam I have here is: "成功率13%は完全保証します。" (saikouritsu 13% wa
kanzen hoshoushimasu translates to "Success Rate 13% is Perfect Guaranteed.").
ま: 2437
The hiragana syllable ma is used for ます (masu), see above and ました (mashita)
which is a modifier for past tense verbs. It is also used as a modifier for the
negative present tense verbs, ません (masen) and past tense verbs ませんでした
(masen deshita). A usage in spam is: "理恵子の悩みを聞いてもらえませんか?"
(translates to "Reiko's trouble hear didn't you?" meaning "Didn't you hear
about Reiko's trouble?")
。: 2250
This character is the Japanese period. It is used for statements, but not
always since it's a bit of a hack on the language. The only interesting thing
about this statistic is how much it gets used, and to compare it to the ASCII
period, 11391 (5:1 ratio). This means that in my dataset, periods used for
headers (which were sadly not stipped) greatly outweigh Japanese periods.
な: 2170
The hiragana syllable na is used for spelling out the kanji ない (nai
translates to negative, without, nothing) as well as a multitude of other uses.
The first two uses I saw were: オバサン達なので... (Obasan-tachi nano de
translates to "Old women because it is...") and "簡単な" (kantanna translates
to "simply").
に: 2164
Preposition in on to from as well as other uses.
か: 2100
question participle, desukara dakara as well as many other uses.
る: 2088
verb modifier for dictionary form
>: 2063
Used for pointing and layout.
>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<
を: 2042
Preposition from, object marker.
と: 1897
particle and.
<: 1885
Used for pointing and layout.
は: 1810
Though pronounce ha most of the time, as a particle it is pronounced wa. It is
the topic marker particle. A common use is "私はアメリカン人です" (watashi wa
Amerikan jin desu translates to "I am American person.")
た: 1705
tab: 1649
*: 1630
ら: 1597
が: 1591
も: 1373
っ: 1327
お: 1284
り: 1217
会: 1080
The first actual kanji is more common than quite a few syllables most notably
あ (a). This kanji means meeting and/or understanding. Sadly, it is not actual
the most popular kanji in the Japanese language, so this means our data set is
definitely skewed toward words used in e-mail spam. It is also possible that
it could be skewed toward Japanese advertising e-mail or friendly e-mail, but
more likely, this kanji is simply a popular word. A use I saw was 出会 (shukkai
translates to "encounter"), you can guess the context.
れ: 961
性: 944
The second most popular kanji is rather telling, gender (sei) is often used for
gender and sexual topics. The word for sexual intercourse (性交) accounts for
probably nearly half of these mentions. Note that not far down the list, the
second kanji in that word (交) is mentioned 509 times, over half as many times
as sei. Non-sexual uses of this kanji are possible and common, since it's a
very common idea, for example 性行 means "character and conduct", 性質 means
"nature", "property", or "disposition", 陰性 means negative, and many other
examples exist. Foreigners learning useful words for paperwork on a trip to
Japan should memorize this kanji as well as the various words made with this
kanji. Scientists, especially physicists and engineers need to know this kanji
because it is used quite often in conjunction to describe properties such as:
感受性 (sensitivity), 揮発性 (volatile), 偽性 (pseudo), 現実性 (feasible),
合法性 (lawfulness), 水溶性 (water-soluble), 潜伏性 (latency), and so forth.
This word is very versatile and cannot be added to bayesian spam filters
without connecting it with the specific trailers that cause its meaning to be
sexual. A common use of this in spam is "女性が男性へ謝礼を支払う・・・" (onnasei
ga otokosei he sharei wo shiharai translates to "A woman pays reward to a
man").
!: 899
The exclamation is used quite often in spam. No other explanation neccesary.
あ: 892
女: 860
The third most popular kanji (onna) gives away the source to any Japanese
student, woman. No source of information anywhere in the entire body of works
in human history has ever discussed women with this frequency. Only an author
of post-modern poetry who pastes the character a hundred times in a row comes
close to using this kanji at this frequency. It accounts for 860/606463 =
0.14% of all e-mail content. Comparatively, whitespace (including the
ideographic space and tab) accounts for 63351/606463 = 10%. That means for
every 74 whitespace, there is one mention of a woman. I think this proves more
about spam than it does the human race though.
方: 631
The fourth most popular kanji is a bit of a surprise, (kata) means "person". It
has many uses other than person of course, such as (hou) "side". Combinations
often use the hou form or one of the many other forms. A very popular word,
彼方 (achira translates to "there") uses another form of the kanji.
出: 569
The fifth most popular kanji is de meaning "outflow/coming (going) out".
Although the literal is similar to a popular English spam word, the most
popular use of 出 is as part of the very common verb 出来る (できる dekiru means
"to be able"). A use in a spam subject is: ご登録準備が出来ました
(ごとうろくじゅんびができました gotourokujyunbigadekimashita means "Entry ready
able").
交: 509
The sixth most popular kanji is maji meaning "trade or mix". A use in spam is
全国47都道府県・中高年肉体交流の会 which translates to
"Meeting of nationwide 47 metropolis and districts middle to elderly-aged
physical interchange"
無: 483
The seventh most popular kanji nai meaning "no/nothing/without" is quite useful
because it's quite common in Japanese signs and advertising.
人: 435
The eighth most popular kanji nin means "person". It is used quite often in
Japanese signs and communication.
番: 416
The ninth most popular kanji ban is used as a counter for position. For example
一番 (ichiban meaning "best/first") is used quite often in anime and film. A
use in spam is 番組スタッフが明かす! (ばんぐみスタッフがあかす! bangumi staff ga
akasu! which means "TV program staff is revealed!") In this case ban is used
in the noun bangumi which means TV program.
Conclusion
I am very pleased with the results of this research. Using a very unuseful set
of data, I was able to provide statistical analysis and retrieve interesting
and quite useful data from it. The data set is very large and continues to grow
daily, so analysis can actually continue as a permanent project.
I am very interested in continuing this research for projects such as my kanji
handwriting analysis project and for amateur manga translation projects.
Projects such as Kanjilish and the Unicode Fuzzer are also very interesting
projects that would definitely benefit from the data in this project.
Bibliography
[5] Japanese For Busy People I, Kana Version, page 66.
If you are interested in analyzing Japanese Spam or other languages using my
research methods, feel free to
contact me.
Permalink
Comments: 0
Leave a reply »