A Data Mining tool for issuing predictions, summarising data, and revealing interesting phenomena
Abraham Meidan, Ph.D.
Many data sets contain valuable information that is not readily obvious. Examples of such information might be:
- Patterns of high-risk companies within financial data;
- Types of customers in a direct mailing list who are most likely to make a purchase;
- The relationships between patients’ personal data and their medical diagnosis.
The search for these valuable, yet hidden, patterns and relationships within the data is known as data mining.
Users may be interested in data mining applications for several reasons:
- Some users are interested in data mining for issuing a summary of the data: When the data is too numerous to be reviewed record by record, there is a need for a summary, and revealing the main patterns in the data provides a useful summary.
- Other users expect data mining to reveal interesting phenomena in the data. These users wish to ignore trivial cases and concentrate rather on unexpected phenomena.
- Still other users are interested in issuing predictions for new cases. They wish to reveal the patterns in the data in order to use them for issuing predictions for new cases. For example, revealing the patterns of risky customers in financial data, enables one to predict whether a new customer is risky or not.
How does WizWhy work?
Prior to using WizWhy, one should have a data set that he or she wishes to analyse. WizWhy will determine how the values of one field are affected by the values of other fields.
For example, suppose that you maintain a customer data bank where each record contains a range of fields relating to a customer, such as: Customer Name, Address, City, State, Field of Business, Sale Person, Amount Purchased, % Growth since Last Year. One of these fields, say, % Growth since Last Year, should be defined as the dependent variable, while the other fields are the independent variables.
In this example, the aim might be analysing customer retention, i.e., revealing the patterns of those customers where the % Growth since Last Year is small. The % Growth since Last Year might then be analysed as either Boolean or continuous. In a Boolean analysis the aim might be revealing the patterns of the customers where the Growth since Last Year is less than, say, 1% (or not below it). A continuous analysis is more detailed: it calculates the specific % growth of each customer as a function of the other fields.
On analysing the data, WizWhy performs the following operations:
- It first reads the data. The user selects the dependent variable (% Growth since last year) and can fine-tune the analysis by defining parameters such as the minimum probability of the rules, the minimum number of cases in each rule, and the cost of a miss vs. the cost of a false alarm. WizWhy follows these “instructions” when issuing the rules.
- Within a short time, WizWhy lists the rules that relate between the dependent variable and the other fields. The rules are formulated as “if-then” and “if-and-only-if” sentences. On the basis of the discovered rules WizWhy also points out the main patterns, the unexpected phenomena and the unexpected cases in the data.
- WizWhy can now make predictions for new cases; for instance, given the data of a new customer, WizWhy can calculate the expected % Growth. These predictions can be either Boolean (for instance, whether or not the % Growth is above 1%) or continuous (for example, the % Growth is between 5% – 7%).
What is an if-then rule?
WizWhy starts analysing the data by revealing all the if-then rules that relate between the Dependent Variable and the other fields. An example of an if-then rule is:
If City is New-York
and Amount Purchased is 200 … 300 (average = 250)
and Sale Person is Dave
Growth since Last Year is less than 1%
Rule’s probability: 0.70
The rule exists in 370 records
Significance Level: Error probability < 0.001
This rule says that for 70% of the customers, residing in the city New York, and purchase between 200 and 300, and the sale person is Dave, the growth since last year is less than 1%.
The term “probability” designates what other data mining tools call “Confidence Level”. Obviously, this probability should be significantly higher than the overall frequency of the value under analysis (i.e., the frequency of customers, where the Growth since Last year is less than 1%, is much lower than the rule probability, 70%).
“Error probability” indicates the degree to which the rule can be relied upon as a basis for predictions. Assuming that the data under analysis is a representative sample of an infinite population, the error probability quantifies the chances that the rule does not hold in the entire population and exists accidentally in the file under analysis.
Numeric fields, such as Amount Purchased and % Growth, are automatically segmented into intervals, and these intervals are the values in the if-then rules. For example, in the if-then rule above, the second condition refers to the case where the value in the Amount Purchased field is the interval between 200 and 300. WizWhy employs a unique algorithm for the optimal segmentation of numeric (continuous) fields.
Revealing all the if-then rules is known as the “association rules” method. One of the main challenges of such a method is to validate each possible relationship, in a reasonable time-span. For instance, data might contain:
- 10,000 records
- 20 fields in each record
- An average of 10 possible values for each field
To check every possible relation in such data, by using conventional means would require thousands of years. WizWhy employs a sophisticated algorithm that reveals all the rules in an astonishingly short time. In the previous example, it would take WizWhy just a few minutes to discover all the if-then rules under investigation.
What is an if-and-only-if rule?
On the basis of the if-then rules WizWhy proceeds to search for if-and-only-if rules. An example of an if-and-only-if rule is:
The following conditions explain when the Growth since Last Year is less than 1%:
If at least one of these conditions holds, the probability that the Growth since Last Year is less than 1% is 0.9
If none of these conditions holds, the probability that the Growth since Last Year is not less than 1% is 0.95
The conditions are:
- The Amount Purchased is between 0 … 199 (average = 100)
The Sale Person is Dan
and the City is Boston
In other words, the Growth since Last Year is less than 1%, if and only if, the Amount Purchased is between 0 … 199 or the Sale Person is Dan and the City is Boston.
If-then rules represent sufficient conditions (the “if” condition is a sufficient condition for the result). If-and-only-if rules go one step further: they represent necessary and sufficient conditions. In the previous example, the two conditions, (1) and (2), are necessary and sufficient conditions for the growth being less than 1%.
Obviously such a relation cannot be accidental, and therefore might be relied upon when issuing predictions. Indeed, when WizWhy reveals if-and-only-if rules it takes them into account when issuing predictions for new cases.
Revealing the if-and-only-if rules also helps in pointing out the main patterns in data. One common “complaint” against if-then rules is that they are too numerous. Indeed, in many data sets WizWhy may discover thousands of if-then rules. All of them are valid and may be used for issuing predictions, but practically they cannot be read in order to understand the data. Revealing if-and-only-if rules solves this problem. Since WizWhy searches for the optimal if-and-only-if rules, each value of the dependant variable is explained by one or two if-and-only-if-rules. These rules are optimal in the sense that they cover the maximum number of both positive and negative examples.
How does WizWhy summarize the data?
As already mentioned, some users are interested in data mining in order to issue a data summarisation. They are interested in a report that presents the main patterns in the data.
WizWhy meets this target by listing the relations between all the values in each field and the dependent variable. Consider the Amount Purchased field in the above- mentioned example. WizWhy segments the field into intervals (as mentioned, WizWhy employs a unique algorithm that segments numeric fields in an optimal way), and displays the relation between each interval and the value under analysis (Growth since Last year is less than 1%).
For example, when analysing the Amount Purchased field, WizWhy may reveal the following relations:
|IF The Amount Purchased is between:||THEN the probability, that the Growth since Last Year is less than 1%, is:|
|0 – 199||50%|
|200 – 300||40%|
|301 – 480||30%|
|481 – 791||15%|
Each line is an if-then statement. Some or even all of these if-then statements may not be rules, since they may not meet the requirement that the rule probability be significantly higher or lower than the primary frequency. Still, all of them represent basic trends in the data.
WizWhy applies such an analysis to each of the fields. When the field is categorical (such as City or Sale Person), WizWhy displays the largest values (all the other values are grouped into one additional value).
WizWhy also calculates the explanatory power of each field. The explanatory power designates how well the field explains the dependent variable. When sorting the fields by this parameter one can see the fields listed according to their “importance” in explaining which customers are likely to leave for a competitor.
WizWhy illustrates graphically these one-condition relations: each value is represented as a bar, where the height denotes the probability of the customer leaving for a competitor, and the width signifies the number of cases (i.e., customers) having this value.
This analysis of the basic rules and trends results in a data summarisation. These basic rules and trends summarise the data, in the sense that they explain all the other rules, with the exception of the unexpected rules. This idea will be further discussed in the next section.
How does WizWhy reveal interesting phenomena?
Revealing interesting phenomena relies on the assumption that unexpected phenomena are interesting. For example, an event that is inconsistent with (namely unexpected by) an accepted theory is an interesting event. Now, each rule can be viewed as an event, and the one-condition rules and trends, discussed in the previous section, can be viewed as the “basic theory” that describes the data. Therefore, by calculating how unlikely each rule is relative to the basic trends, the unexpected rules can be revealed. These unexpected rules signify the interesting phenomena in the data.
The Unexpected Rule has at least two conditions. The Basic Rules have fewer conditions (in many cases they have one condition only), and the Basic Trends, by definition, have one condition only, as well. Each of the conditions of the Unexpected Rule appears in the Basic Rules and Trends. The Unexpected rule is unlikely relative to the Basic Rules and Trends.
The level of unlikelihood is computed in the following manner: consider a data set of 1000 records, where each record refers to one patient, and contains the information whether the patient shows either symptom A or B and the diagnosis (whether or not the patient suffers from the disease D). Suppose also that 30% of the patients have the disease D, and the following three rules were discovered:
- If a patient shows symptom A, the probability that he or she suffers from the disease D is 60%.
- If the patient shows symptom B, the probability that he or she suffers from the disease D is 60%.
- If the patient shows both, symptom A and symptom B, the probability of the disease D is 20%.
In this example, rules no. 1 and 2 are the Basic Rules, and rule no. 3 is the Unexpected Rule.
WizWhy calculates what should have been the probability of suffering from the disease D among the patients showing both symptoms, A and B, on the basis of rules 1 and 2, contrary to the actual probability in rule 3. This is the expected probability. To calculate the expected probability WizWhy measures the dependency between symptom A and B. The difference between the expected probability and the actual probability signifies how unlikely rule 3 is in regard to rules 1 and 2.
WizWhy measures this unlikelihood in an additional way. WizWhy calculates the conditional probability of the event described in rule 3, under the conditions described in rules 1 and 2. In the example under discussion, the conditional probability is almost 0, and therefore, the level of unlikelihood, which is 1 minus the conditional probability, is almost 1. Note that the probability of the Unexpected Rule may be much higher or much lower than the expected probability. Any significant deviation is unexpected.
How does WizWhy issue predictions?
WizWhy makes use of the rules discovered in the data set in order to issue predictions for new cases. When a new record is entered, WizWhy applies the rules on the values of this record and calculates the expected value of the dependent variable. For example, one can run WizWhy on financial data, where the dependent variable is a field signifying whether the company has gone bankrupt. WizWhy will reveal the rules that relate between the company data and the probability of going bankrupt. When the data of a new company is entered, WizWhy will then apply the relevant rules, and calculate the probability of this company going bankrupt. When issuing the predictions WizWhy can list the rules that entail each prediction. These rules serve as the explanations for the predictions.
The predictions can be either Boolean (for instance, whether the company will go bankrupt or not) or multi-value (for example, given the patient’s symptoms, what is the disease) or continuous (for instance, given the financial data, what is the expected rate of growth).
Still, when issuing the predictions on the basis of rules one faces the following two problems:
(i) How can rules representing noise or overfitting be avoided?
For any rule discovered in the data, one may ask, what is the reason or the explanation for the existence of this rule. The two possible extreme answers are (i) the rule exists in the entire population, where the data under analysis is a representative sample, or (ii) the rule is the result of a chance, i.e., it is a case of a noise oroverfitting. Obviously, when issuing predictions for new cases, rules existing in the entire population should be taken into account while rules representing noise should be ignored.
WizWhy measures the probability that a rule is the result of a chance by considering two parameters:
- The error probability of the rule: The lower the error probability, the lower the probability that the rule is the result of a chance, and therefore the higher the probability that the rule exists in the entire population.
- The level of unlikelihood of the rule (in case the rule is unexpected): The higher the level of unlikelihood, the lower the probability that the rule is an accidental phenomenon, and therefore the more the rule can be relied on when issuing predictions.
Note however that this above problem does not relate to the if-and-only-if rules. Obviously these rules cannot be accidental and therefore predictions that are based on them do not suffer from the problem of overfitting.
Avoiding overfitting is one of the main features that differentiate between WizWhy and other prediction applications based on neural nets, decision trees or genetic algorithms. As a result, the accuracy level of WizWhy predictions is usually much higher in comparison with other approaches.
(ii) How can cases of rule inconsistency be solved?
When using if-then rules to issue predictions one may face cases where some of the rules predict one way while others predict the other (for example, some of the rules predict that the company tends to go bankrupt, while other rules predict that it does not). WizWhy solves this problem by using the error probability and thelevel of unlikelihood. Since these two parameters signify the extent to which the rule can be relied on, WizWhy weighs the rules according to these parameters and calculates the prediction in cases of inconsistency.
Once again, note that the problem does not apply to the case where the predictions are based on the if-and-only-if rules. The if-and-only-if rules are always consistent.
How does WizWhy point out unexpected cases?
On top of issuing predictions for new cases, the rules can be used to point out unexpected cases in the data. WizWhy issues predictions regarding the records of the data set under analysis and points out cases where the dependent variable’s value deviates from the value anticipated according to the rules. Such a deviation may be the result of noise, but it can also indicate a data entry error, a fraud or another type of error. WizWhy lists those cases, displaying the expected value together with the relevant rules. The user can review the cases in order to audit the data.
What kind of data does WizWhy read?
WizWhy can read data both directly or through ODBC and OLE DB:
- WizWhy directly reads ASCII, *.dbf (dBase, FoxPro, etc.), MS Access, MS SQL and Oracle data sets.
- WizWhy also reads any ODBC or OLE DB compliant database.
- WizWhy can join several tables into one data set.
Who can benefit from WizWhy?
Data mining in general, and WizWhy in particular, have many practical applications. You can use WizWhy in most cases where simple or complex data analysis and predictions are required. For instance:
- WizWhy can assist professionals in the fields of medicine and social sciences to enhance their diagnostic and research efforts.
- Banks and financial institutions can use WizWhy to indicate risky customers.
- Corporations implementing direct marketing will find that WizWhy is the ideal tool for increasing the success rate of direct mailing.
Scientific research: WizWhy can be applied for inferring rules in a wide range of scientific fields, including medicine, economics, psychology, sociology and geology.
- In medical research, WizWhy can reveal laws relating between symptoms and diseases.
- In geological research, WizWhy can reveal rules that show the relationships between soil data and mineral location.
- Researchers in social sciences can use WizWhy to discern patterns and characteristics that designate students’ aptitude for academic success.
Generally, in research entailing large quantities of data, WizWhy can save significant time and effort by assuming the scientist’s burden of revealing rules. WizWhy may just revolutionise traditional methods of conducting research in scientific fields.
Banks, credit and insurance companies: These organizations can use WizWhy to discern financially unstable customers and to predict the degree of financial risk of a new customer. To do so, WizWhy reads the customer data file and reveals the patterns of the financially risky customers. Rules describing these patterns are saved to file. Later, company personnel can analyze a new customer by entering the customer’s data and having WizWhy calculate the extent of that customer’s financial risk.
Market research: Market researchers can use WizWhy to improve the success rate of direct marketing campaigns. Prior to mailing a large quantity of marketing literature, a small batch can be sent to a representative sample of the population, and WizWhy can then discover the patterns of both those customers more likely and less inclined to purchase, relative to the average purchasing rate. For example, if the purchasing rate is 2%, WizWhy can be instructed to discover those customers whose purchasing rate is 4% and above, or 1% and below. The result can then be applied to the master file, in order to define the optimal direct mailing list.