**Applying machine learning to the analysis of text – The Case of “The Spring and Autumn Annals of Master Yan” **

Abraham Meidan, PhD This paper was submitted in conference at Renmin University , Beijing, China

WizSoft

In what follows we present a machine learning technique for analyzing text. The examples refer to the YZCQ text.

Machine learning is a set of computerized statistical techniques that “learn” by discovering valid patterns in the data. In the standard case the data are saved in a flat file (like an Excel sheet). The rows refer to the records of some population. One column is the field to be explained, the dependent variable. The rest of the columns are the independent variables. The machine learning algorithm is supposed to find a valid model that explains the values of the dependent variable as a function of the values of the independent variables.

As mentioned the model has to be valid: “Validity” means that when the model is used for issuing predictions in regard to the expected values of new records (belonging to the same population that was used for creating the model), the accuracy of the predictions is higher than that of random predictions, or predictions that are based only on the frequencies of the various values of the dependent variable. If the predictions fail to meet these expectations, the model is said to be the result of coincidental patterns (or in the professional language – the result of overfitting).

As mentioned the data should be saved in a flat file like an Excel sheet. This is a structured data. Text data are usually saved as unstructured data – there is no straightforward way to preset text data in an Excel sheet. Still one can convert some of the contents of the text into structured data and then apply a machine learning algorithm.

In the present research the question was: Can the existence of certain words in sections of the YZCQ be explained as a function of the existence of other words in these sections?

We selected the following words as the dependent variables:

民

社稷

仁

The independent variables were the rest of the words in the text, but for reasons that will be explained later we restricted the list to the 200 most frequent words.

Since many signs in Chinese have more than one meaning, we created additional flat files where the independent variables were the 200 most frequent pairs of successive signs.

So we analyzed six tables. Three tables (each one for a different dependent variable) had the following structure:

Section# | Word1 | Word2 | Word3 | …. | …. | …. | Dependent variable |

1 | |||||||

2 | |||||||

3 | |||||||

4 | |||||||

…. | |||||||

…. | |||||||

….. |

And three additional tables referred to pair of words and had the following structure:

Section# | Pair of Words1 | Pair of Words2 | Pair of Words3 | …. | …. | …. | Dependent variable |

1 | |||||||

2 | |||||||

3 | |||||||

4 | |||||||

…. | |||||||

…. | |||||||

….. |

In each cell the value was either 1, when the word (or pair of words) exists in the section, or 0, when it does not.

There are various algorithms of machine learning. We selected an algorithm that reveals if-and-only-if rules (necessary and sufficient conditions). You can read about this algorithm in: Abraham Meidan: Wizsoft’s WizWhy, in Oded Maimon, Lior Rokach (Eds.), *The Data Mining and Knowledge Discovery Handbook*, Springer 2005, pp. 1365-1369.

The main reason for using this algorithm is: it displays an easy to understand model. When using other algorithms the model is either a black box (this is the case when using artificial neural networks), or too complex to be easily understood (this is the case when using random forest).

The YZCQ text includes 217 sections. This is quite small. The rule of thumb is that in order to avoid overfitting (that is, revealing accidental patterns) the number of rows should be 20 times the number of columns. In the current research since we had 200 columns (words or pairs of words), we should have 4,000 sections rather than 215. This is the reason why we limited the research to 200 most frequent words (or 200 most frequent pairs of word: If we included more words or pairs of words we would increase the risk of revealing accidental patterns).

**The results**

Below is the analysis of the text when the dependent variable is: 民

**If-and-only-if Rule 1 (out of 2)**

The following conditions explain when

民 *exists*

1) 我 *does not exist*

and 歛 *exists*

2) 樂 *exists*

and 內 *exists*

3) 厚 *does not exist*

and 財 *exists*

4) 得 *exists*

and 和 *exists*

5) 知 *does not exist*

and 怨 *exists*

6) 是 *exists*

and 窮 *exists*

7) 此 *does not exist*

and 力 *exists*

8) 使 *does not exist*

and 禁 *exists*

9) 欲 *exists*

and 危 *exists*

10) 令 *exists*

and 多 *exists*

11) 用 *exists*

and 退 *exists*

12) 如 *exists*

and 姓 *exists*

13) 日 *exists*

and 厚 *exists*

14) 政 *exists*

and 當 *exists*

15) 用 *exists*

and 祿 *exists*

16) 國 *exists*

and 正 *exists*

17) 乎 *does not exist*

and 邪 *exists*

18) 臣 *exists*

and 財 *exists*

19) 請 *does not exist*

and 禁 *exists*

20) 王 *exists*

and 說 *exists*

21) 成 *exists*

and 臺 *exists*

22) 夫 *does not exist*

and 眾 *exists*

23) 王 *exists*

and 長 *exists*

When at least one of the conditions holds, the probability that

民 *exists*

is **0.971** (100 out of 103 cases)

When all the conditions do not hold, the probability that

民 *does not exist*

is **0.930** (106 out of 114 cases)

The total number of cases explained by the set of conditions: **206**

The total number of cases in the data: **217**

Success rate: **0.949** (206 / 217)

The primary probability that:

民 *exists* is **0.498** (108 out of 217 cases)

民 *does not exist* is **0.502** (109 out of 217 cases)

Improvement Factor: **9.818** (min((108*1),(109*1)) / (8*1+3*1))

**If-and-only-if Rule 2 (out of 2)**

The following conditions explain when

民 *does not exist*

1) 何 *does not exist*

and 賜 *exists*

2) 景 *does not exist*

and 善 *exists*

3) 矣 *does not exist*

and 左 *exists*

4) 國 *does not exist*

and 左 *exists*

5) 然 *does not exist*

and 乘 *exists*

6) 景 *does not exist*

and 齊 *exists*

7) 此 *exists*

and 二 *exists*

8) 公 *does not exist*

and 對 *does not exist*

9) 是 *does not exist*

and 去 *exists*

10) 問 *does not exist*

and 辭 *exists*

11) 所 *does not exist*

and 一 *exists*

12) 上 *does not exist*

and 酒 *exists*

13) 見 *exists*

and 入 *exists*

14) 治 *does not exist*

and 殺 *exists*

When at least one of the conditions holds, the probability that

民 *does not exist*

is **0.776** (90 out of 116 cases)

When all the conditions do not hold, the probability that

民 *exists*

is **0.812** (82 out of 101 cases)

The total number of cases explained by the set of conditions: **172**

The total number of cases in the data: **217**

Success rate: **0.793** (172 / 217)

The primary probability that:

民 *does not exist* is **0.502** (109 out of 217 cases)

民 *exists* is **0.498** (108 out of 217 cases)

Improvement Factor: **2.400** (min((108*1),(109*1)) / (26*1+19*1))

Each of these two rules explains when the sign 民 exists (or does not exist) in the sections of the YZCQ text. The first rule lists 23 conditions, and each condition is composed of two sub-conditions. The rule says that if condition #1 holds or condition #2 holds or …. condition #23 holds, then there is a high probability that 民 exists in the section, and if all the conditions do not hold, then there is a high probability that 民 does not exist in the section.

Note that to say that all the conditions do not hold is to say that (referring to the first rule) –

1) 何 *does exist*

or 賜 *exists* and

2) 景* exist*s

or 善 *does not exist* and

3) 矣* exist*s

or 左 *does not exists* and ….

etc…

This formulation follows De-Morgan law in Logic according to which

Not (A *and* B) is equal to (Not-A *or* Not-B)

Not (A *or* B) is equal to (Not-A *and* Not-B)

As mentioned the rule presents necessary and sufficient conditions. These conditions refer to *all* the records (contrary to if-then rules that usually refer to some records only).

When revealing the rules the target is to maximize the number of records that are explained by the conditions (both the sections where 賜 exists and the sections where do not exist) and to minimize the number of the conditions. In other words the program looks for a model that is as simple as possible and as accurate as possible. Obviously usually there is a trade-off between these two targets.

At the end of each rule the program displays the improvement factor: this number denotes how much the predictions based on the rule are better than predictions that are based on the frequencies of the values, taking into account the cost of a miss and the cost of a false alarm. However this issue is beyond the scope of this paper.

The second dependent variable was: 社稷

The program discovered only one rule. This rule includes just 5 conditions, so it is much simpler that the previous rules.

**If-and-only-if Rule 1 (out of 1)**

The following conditions explain when

社稷 *does not exist*

1) 令 *does not exist*

and 遂 *does not exist*

2) 所 *exists*

and 危 *does not exist*

3) 國 *does not exist*

and 朝 *does not exist*

4) 大 *does not exist*

and 遂 *exists*

5) 下 *does not exist*

and 令 *exists*

When at least one of the conditions holds, the probability that

社稷 *does not exist*

is **0.995** (204 out of 205 cases)

When all the conditions do not hold, the probability that

社稷 *exists*

Is **1.****00** (12 out of 12 cases)

The total number of cases explained by the set of conditions: **216**

The total number of cases in the data: **217**

Success rate: **0.995** (216 / 217)

The primary probability that:

社稷 *does not exist* is **0.940** (204 out of 217 cases)

社稷 *exists* is **0.060** (13 out of 217 cases)

Improvement Factor: **13.000** (min((13*1),(204*1)) / (1*1+0*1))

Finally we analyzed the dependent variable: 仁

Unfortunately the rules that explain the existence of this sign are much weaker than the previous rules:

The program discovered just one rule that refers to the single signs in each section.

**If-and-only-if Rule 1 (out of 1)**

The following conditions explain when

仁 *does not exist*

1) 臣 *does not exist*

and 聞 *does not exist*

2) 治 *exists*

and 歸 *does not exist*

3) 國 *does not exist*

and 行 *exists*

4) 君 *does not exist*

and 及 *does not exist*

5) 矣 *does not exist*

and 若 *exists*

6) 可 *does not exist*

and 欲 *exists*

7) 言 *does not exist*

and 朝 *exists*

8) 成 *exists*

and 入 *does not exist*

9) 三 *does not exist*

and 受 *exists*

10) 焉 *does not exist*

and 哉 *exists*

11) 死 *does not exist*

and 二 *exists*

12) 景 *exists*

and 邪 *exists*

13) 何 *does not exist*

and 日 *exists*

14) 君 *exists*

and 正 *exists*

15) 乎 *does not exist*

and 臣 *does not exist*

16) 矣 *does not exist*

and 賢 *exists*

When at least one of the conditions holds, the probability that

仁 *does not exist*

Is 1.00 (190 out of 190 cases)

When all the conditions do not hold, the probability that

仁 *exists*

is **0.963** (26 out of 27 cases)

The total number of cases explained by the set of conditions: **216**

The total number of cases in the data: **217**

Success rate: **0.995** (216 / 217)

The primary probability that:

仁 *does not exist* is **0.880** (191 out of 217 cases)

仁 *exists* is **0.120** (26 out of 217 cases)

Improvement Factor: **26.000** (min((26*1),(191*1)) / (0*1+1*1))

The second rule refers to the pairs of signs. And once again, only one rule was discovered.

**If-and-only-if Rule 1 (out of 1)**

The following conditions explain when

仁 *does not exist*

1) 子曰 *does not exist*

and 天下 *does not exist*

2) 也公 *does not exist*

and 曰君 *exists*

3) 不足 *exists*

and ！晏 *does not exist*

4) 何晏 *exists*

5) 君之 *does not exist*

and 之以 *exists*

6) 先君 *exists*

and ！晏 *does not exist*

7) 曰嬰 *does not exist*

and 以不 *exists*

8) 之所 *does not exist*

and 曰夫 *exists*

9) 君子 *does not exist*

and 子晏 *exists*

10) 夫子 *does not exist*

and 之〕 *exists*

11) 天下 *does not exist*

and 之行 *exists*

12) 之晏 *does not exist*

and 曰善 *exists*

13) 公不 *exists*

and 之言 *does not exist*

14) 公問 *exists*

and 也公 *does not exist*

15) 不可 *does not exist*

and 曰臣 *exists*

16) 古之 *does not exist*

and 〕不 *exists*

17) 不可 *does not exist*

and 君子 *exists*

18) 景公 *does not exist*

and 曰嬰 *does not exist*

When at least one of the conditions holds, the probability that

仁 *does not exist*

is **0.979** (187 out of 191 cases)

When all the conditions do not hold, the probability that

仁 *exists*

is **0.846** (22 out of 26 cases)

The total number of cases explained by the set of conditions: **209**

The total number of cases in the data: **217**

Success rate: **0.963** (209 / 217)

The primary probability that:

仁 *does not exist* is **0.880** (191 out of 217 cases)

仁 *exists* is **0.120** (26 out of 217 cases)

Improvement Factor: **3.250** (min((26*1),(191*1)) / (4*1+4*1))

**Conclusion**

Machine learning techniques can be used in order to discover patterns in text. We demonstrated applying a machine learning technique on the YZCQ text but obviously any text can be analyzed in this method.

Are the above-mentioned rules interesting? When the number of conditions is small and the accuracy (probability) is high, the rule is unexpected, and being unexpected is a necessary condition for being interesting. It is a necessary condition but not a sufficient one. The scholars of Chinese texts have to say whether or not these rules may contribute to their research.

If the answer is positive, the following two recommendations are relevant:

When analyzing other text files it is recommended to look for files having many sections (the more the better) in order to avoid revealing accidental patterns.

It is also recommended to use a machine learning algorithm that issues an easy to understand model. If you want to use the software program that was used in this research download WizWhy demo from www.wizsoft.com. The demo version is identical to full version except for being limited to 1,000 records. If your text data include more than 1,000 sections you may send me the Excel file (convert the Chinese signs into Unicode) and I’ll send you the analysis: abraham@wizsoft.com