I've been working on my thesis for a little too long. I'm hopefully about finished, but that is beside the point. Part of what I've been working on revolves around the Apriori Data Mining algorithm. If you know what Apriori is, and you are looking for how to implement it, then this post is for you.

I've created a JavaScript implementation of Apriori. Admittedly, JavaScript isn't probably the most efficient programming language to implement Apriori with; however, I was constrained to use it for my project [1]. There have been many improvements to the Apriori Algorithm since Agrawal suggested it in 1993. I chose to instead create a simple implementation of the original algorithm. First, learn how the algorithm works; second, learn how to optimize it.


A little background

The purpose of the Apriori Algorithm is to find associations between different sets of data. It is sometimes referred to as "Market Basket Analysis". Each set of data has a number of items and is called a transaction. The output of Apriori is sets of rules that tell us how often items are contained in sets of data. Here is an example:

# each line is a set of items
a b c
a b d
a b e
a b d
# 100% of sets with a also contain b
# 25% of sets with a,b also have c
# 50% of sets with a,b also have d

The associations that Apriori finds are called Association rules. An association rule has two parts. The Antecedent is a subset of items found in sets of data. The Consequent is an item that is found in combination with the antecedent. Two terms describe the significance of the association rule. The Confidence is a percentage of data sets that contain the antecedent. The Support is a percentage of the data sets with the antecedent that also contain the consequent.

From the example above, we could find the following association rules

# consequence <- antecedent (confidence,support)
a <- b (100%, 100%) # b <- a is the same
c <- a b (100%, 25%)
d <- a b (100%, 50%)

It's easy to see the outcome from a simple example, but in some cases, there are thousands (or more) total items to search and possibly a much larger number of combinations of those items that make up all the data sets. Apriori is a fairly efficient way to iterate through the sets of data and finds association rules depending on a particular confidence and support provided as input.

Some more definitions

  • Large Item Set

    A set of items that is contained in enough transactions such that the percentage of transactions that contain this item set is greater than or equal to the minimum support. e.g., in the sets above, [a,d] would be a large item set if the minimum support was less than or equal to 50%.

  • Candidate Set

    A set of items that is currently being tested to see if it is a large item set or not. Candidate sets have a reference count and a current count. The reference count is the number of transactions (or sets of items) in the entire data set that contain this candidate set. The current count is used for calculating a probability of this set being a large item set as the algorithm iterates and adds candidate sets to the frontier set.

  • Frontier Set

    The set of all Candidate Sets. This is reset at the end of each iteration of the algorithm to be the set of candidate sets that are expected to be large but are not yet known if they are really a large item set or not.

I have two implementations of this algorithm for my thesis. An older version was an iterative algorithm that is an almost direct implementation of the original Apriori algorithm. The newer version uses JavaScript 1.7 generators to provide a chunked implementation of that can run easier in FireFox. You are free to download either version for reference or to modify and use as you wish.

Now, the algorithm explained. I've made minor modifications to the actual code here for clarity in explanation.

  1. Initialize your data sets

    You can view my completed code for the definitions of LargeItemSets and CandidateSet

    var L=new LargeItemSets(); // list to store discovered large item sets
    // at first, the frontier set contains one candidate set that contains the empty set
    // the ref_count for this empty set is the length of the transactions (every transaction
    // contains the empty set).
    var F=[new CandidateSet()]; // list of frontier sets
    F[0].ref_count=trans.length; // trans=list of all transactions
    var passCount=0; // the number of times the algorithm has iterated.
    var support=.1; // whatever percentage you want to define for support
    var confidence=.5 // whatever percentage you want to define for confidence

  2. Iterate through the transactions

    The Frontier set contains one candidate set to begin with. Each iteration will clear and add new candidate sets to this set. When there are no new candidate sets in the frontier set, the algorithm proceeds to the next step.

    while (F.length >0) {
     var C=[]; // the list of candidate sets

    In each pass, you test each transaction against all of the frontier sets. If the the transaction contains the frontier set, then you extend the frontier set with each of the items in the transaction that are not part of the frontier set. Those extensions (new candidate sets) are compared with the existing candidate sets. If a copy of a candidate set is found, (they contain the same items), the original candidate set's ref_count is incremented. Otherwise, the new candidate set is added to the candidate set list. The following code uses JavaScript 1.7 functions for iterating through the list and testing for list inclusion (forEach, some, et al).

    for (var c=0;c<trans.length;++c) {
     var t=trans[c]; // t will be the current transaction to test
     // for each f in F
     F.forEach(
      function(f) {
      // if t contains f
       if (f.contained_in(t) ) {
        f.cur_count++;
        for (newcs in extend(f,t,c,support,[])) {
         // if cf E C inc ref count else add to C
         if (!C.some(function(lcs) {
          if (lcs.equals_set(newcs.set)) {
           lcs.ref_count++;
           if (!f.expected_large) lcs.expected_large=false;
            return true;
           } else {
            return false;
           }
          })) {
           C.push(newcs);
         }
        }
       }
      }
     );
    }

    Here is the extend function. For this to work, the items have to be ordered. Basically, it extends the item set (itemSet) by each of the items not already contained in the item set that are contained in the transaction (items). It also tests if the set should be extended by additional items by testing whether or not the reference count - current count (c) divided by the length of the transactions (the percentage of the progress iterating through the transactions) is greater than the percentage of transactions that contain the specific item. In other words, if there is a fair probability that extending this set further will result in another large item set, the algorithm does that now rather than requiring an entire new iteration through the transactions in order to find the new large item set. That part can actually be skipped, but it does speed up the algorithm considerably when the number of transactions is large. You'll have to read the original Apriori paper and determine for yourself if I've implemented that portion correctly. If I didn't, this all still works but is just less efficient than it could be since it will take additional iterations to find the bigger large item sets.

    function extend( itemSet, items, c, support, ik ) {
     var ij = itemSet.set.length>0 || ik.length>0 ?
      items.slice (
       ik.length>0 ?
       items.indexOf( ik[ik.length-1] )+1:
       items.indexOf( itemSet.set[itemSet.set.length-1] )+1
      ) :
      items;
      //ij.forEach ( function (i) {
      for (var i=0;i    var cs=new CandidateSet();
       cs.set=itemSet.set.concat(ik,ij[i]);
       cs.ref_count=1;
       yield cs;
       // further extend?
       // expected support = F(u) * (x-c)/dbSize
       var s = (itemSet.ref_count-itemSet.cur_count)/trans.length * all_items[ij[i]].length / trans.length;
       ik.forEach(function(k){
        s *= all_items[k].length / trans.length;
       });
       cs.expected_large = s >= support;
       if ( cs.expected_large )
        for ( nc in this.extend( itemSet, items, c, support, ik.concat(ij[i]) ) )
         yield nc;
     }
    }

    After testing each front set with each transaction, you consolidate a new list of front sets. Any candidate set that meets the minimum support is added to the Large Item sets. If, in addition, that candidate was expected to not be a large set, but it turned out to be large, then it is added to the frontier set.
    // consolidate
    F=[];
    C.forEach(
     function(cs) {
      if (cs.ref_count / trans.length >= support ) {
      L.add(cs); // add cs to Large Item Sets
      if (cs.expected_large == false ) {
       cs.cur_count=0;
       F.push(cs);
      }
     }
    });

  3. Find the association rules

    The association rules are simply taken from subsets of the large item sets. First of all, if there is a Large Item Set that has 4 items, then all the subsets of items of length ( 3, 2, and 1) are also large item sets. The association rule is found by taking the reference count of those item sets of length n-1, where n is the length of the item set, that are contained in the item set, and dividing by the reference count of the item set. The result is the confidence of the association rule. If the confidence meets your minimum confidence requirement, then the rule is output. Here is a not-so-efficient way that I did it, but it illustrates the process.

    LargeItemSets.prototype.yieldRules = function(confidence) {
     for ( var i=this.max_length; i> 1; --i ) {
      for (cs in Iterator(this.sets[i])) {
       // find children of cs
        for (c in Iterator(this.sets[i-1])) {
         if (c[1].contained_in(cs[1].set)) {
         if ( cs[1].ref_count / c[1].ref_count >= confidence ) {
          yield new AssociationRule ( c[1], cs[1] );
         }
        }
       }
      }
     }
    }

There you have it. If I borked something, feel free to leave a comment and let me know.

NOTES:
[1] FireFox has a component API that can load objects written in other languages but I chose not to spend time learning those details.
[2] An old version of my implementation that used a sequential loop (which is described in this post) to execute the Apriori Algorithm. View
[3] A newer version that uses JavaScript 1.7 Generators to implement a chunked Apriori Algorithm. View

UPDATE
This code was hosted on an old server. I created a public github repository to be the new permanent host.
http://github.com/mulicheng/Fireshare