All My Brain Where stuff from my brain lands

November 12, 2007

Implementing the Apriori Data Mining Algorithm with JavaScript

Filed under: Programming — Tags: , , , — Dennis @ 10:29 am

I’ve been working on my thesis for a little too long. I’m hopefully about finished, but that is beside the point. Part of what I’ve been working on revolves around the Apriori Data Mining algorithm. If you know what Apriori is, and you are looking for how to implement it, then this post is for you.

I’ve created a JavaScript implementation of Apriori. Admittedly, JavaScript isn’t probably the most efficient programming language to implement Apriori with; however, I was constrained to use it for my project [1]. There have been many improvements to the Apriori Algorithm since Agrawal suggested it in 1993. I chose to instead create a simple implementation of the original algorithm. First, learn how the algorithm works; second, learn how to optimize it.

A little background

The purpose of the Apriori Algorithm is to find associations between different sets of data. It is sometimes referred to as “Market Basket Analysis”. Each set of data has a number of items and is called a transaction. The output of Apriori is sets of rules that tell us how often items are contained in sets of data. Here is an example:

# each line is a set of items
a b c
a b d
a b e
a b d
# 100% of sets with a also contain b
# 25% of sets with a,b also have c
# 50% of sets with a,b also have d

The associations that Apriori finds are called Association rules. An association rule has two parts. The Antecedent is a subset of items found in sets of data. The Consequent is an item that is found in combination with the antecedent. Two terms describe the significance of the association rule. The Confidence is a percentage of data sets that contain the antecedent. The Support is a percentage of the data sets with the antecedent that also contain the consequent.

From the example above, we could find the following association rules

# consequence <- antecedent (confidence,support) a <- b (100%, 100%) # b <- a is the same c <- a b (100%, 25%) d <- a b (100%, 50%)

It's easy to see the outcome from a simple example, but in some cases, there are thousands (or more) total items to search and possibly a much larger number of combinations of those items that make up all the data sets. Apriori is a fairly efficient way to iterate through the sets of data and finds association rules depending on a particular confidence and support provided as input.

Some more definitions

  • Large Item Set

    A set of items that is contained in enough transactions such that the percentage of transactions that contain this item set is greater than or equal to the minimum support. e.g., in the sets above, [a,d] would be a large item set if the minimum support was less than or equal to 50%.

  • Candidate Set

    A set of items that is currently being tested to see if it is a large item set or not. Candidate sets have a reference count and a current count. The reference count is the number of transactions (or sets of items) in the entire data set that contain this candidate set. The current count is used for calculating a probability of this set being a large item set as the algorithm iterates and adds candidate sets to the frontier set.

  • Frontier Set

    The set of all Candidate Sets. This is reset at the end of each iteration of the algorithm to be the set of candidate sets that are expected to be large but are not yet known if they are really a large item set or not.

I have two implementations of this algorithm for my thesis. An older version was an iterative algorithm that is an almost direct implementation of the original Apriori algorithm. The newer version uses JavaScript 1.7 generators to provide a chunked implementation of that can run easier in FireFox. You are free to download either version for reference or to modify and use as you wish.

Now, the algorithm explained. I've made minor modifications to the actual code here for clarity in explanation.

  1. Initialize your data sets

    You can view my completed code for the definitions of LargeItemSets and CandidateSet

    var L=new LargeItemSets(); // list to store discovered large item sets
    // at first, the frontier set contains one candidate set that contains the empty set
    // the ref_count for this empty set is the length of the transactions (every transaction
    // contains the empty set).
    var F=[new CandidateSet()]; // list of frontier sets
    F[0].ref_count=trans.length; // trans=list of all transactions
    var passCount=0; // the number of times the algorithm has iterated.
    var support=.1; // whatever percentage you want to define for support
    var confidence=.5 // whatever percentage you want to define for confidence

  2. Iterate through the transactions

    The Frontier set contains one candidate set to begin with. Each iteration will clear and add new candidate sets to this set. When there are no new candidate sets in the frontier set, the algorithm proceeds to the next step.

    while (F.length >0) {
     var C=[]; // the list of candidate sets

    In each pass, you test each transaction against all of the frontier sets. If the the transaction contains the frontier set, then you extend the frontier set with each of the items in the transaction that are not part of the frontier set. Those extensions (new candidate sets) are compared with the existing candidate sets. If a copy of a candidate set is found, (they contain the same items), the original candidate set's ref_count is incremented. Otherwise, the new candidate set is added to the candidate set list. The following code uses JavaScript 1.7 functions for iterating through the list and testing for list inclusion (forEach, some, et al).

    for (var c=0;c<trans.length;++c) {
     var t=trans[c]; // t will be the current transaction to test
     // for each f in F
     F.forEach(
      function(f) {
      // if t contains f
       if (f.contained_in(t) ) {
        f.cur_count++;
        for (newcs in extend(f,t,c,support,[])) {
         // if cf E C inc ref count else add to C
         if (!C.some(function(lcs) {
          if (lcs.equals_set(newcs.set)) {
           lcs.ref_count++;
           if (!f.expected_large) lcs.expected_large=false;
            return true;
           } else {
            return false;
           }
          })) {
           C.push(newcs);
         }
        }
       }
      }
     );
    }

    Here is the extend function. For this to work, the items have to be ordered. Basically, it extends the item set (itemSet) by each of the items not already contained in the item set that are contained in the transaction (items). It also tests if the set should be extended by additional items by testing whether or not the reference count - current count (c) divided by the length of the transactions (the percentage of the progress iterating through the transactions) is greater than the percentage of transactions that contain the specific item. In other words, if there is a fair probability that extending this set further will result in another large item set, the algorithm does that now rather than requiring an entire new iteration through the transactions in order to find the new large item set. That part can actually be skipped, but it does speed up the algorithm considerably when the number of transactions is large. You'll have to read the original Apriori paper and determine for yourself if I've implemented that portion correctly. If I didn't, this all still works but is just less efficient than it could be since it will take additional iterations to find the bigger large item sets.

    function extend( itemSet, items, c, support, ik ) {
     var ij = itemSet.set.length>0 || ik.length>0 ?
      items.slice (
       ik.length>0 ?
       items.indexOf( ik[ik.length-1] )+1:
       items.indexOf( itemSet.set[itemSet.set.length-1] )+1
      ) :
      items;
      //ij.forEach ( function (i) {
      for (var i=0;i= support;
       if ( cs.expected_large )
        for ( nc in this.extend( itemSet, items, c, support, ik.concat(ij[i]) ) )
         yield nc;
     }
    }

    After testing each front set with each transaction, you consolidate a new list of front sets. Any candidate set that meets the minimum support is added to the Large Item sets. If, in addition, that candidate was expected to not be a large set, but it turned out to be large, then it is added to the frontier set.
    // consolidate
    F=[];
    C.forEach(
     function(cs) {
      if (cs.ref_count / trans.length >= support ) {
      L.add(cs); // add cs to Large Item Sets
      if (cs.expected_large == false ) {
       cs.cur_count=0;
       F.push(cs);
      }
     }
    });

  3. Find the association rules

    The association rules are simply taken from subsets of the large item sets. First of all, if there is a Large Item Set that has 4 items, then all the subsets of items of length ( 3, 2, and 1) are also large item sets. The association rule is found by taking the reference count of those item sets of length n-1, where n is the length of the item set, that are contained in the item set, and dividing by the reference count of the item set. The result is the confidence of the association rule. If the confidence meets your minimum confidence requirement, then the rule is output. Here is a not-so-efficient way that I did it, but it illustrates the process.

    LargeItemSets.prototype.yieldRules = function(confidence) {
     for ( var i=this.max_length; i> 1; --i ) {
      for (cs in Iterator(this.sets[i])) {
       // find children of cs
        for (c in Iterator(this.sets[i-1])) {
         if (c[1].contained_in(cs[1].set)) {
         if ( cs[1].ref_count / c[1].ref_count >= confidence ) {
          yield new AssociationRule ( c[1], cs[1] );
         }
        }
       }
      }
     }
    }

There you have it. If I borked something, feel free to leave a comment and let me know.

NOTES:
[1] FireFox has a component API that can load objects written in other languages but I chose not to spend time learning those details.
[2] An old version of my implementation that used a sequential loop (which is described in this post) to execute the Apriori Algorithm. View
[3] A newer version that uses JavaScript 1.7 Generators to implement a chunked Apriori Algorithm. View

UPDATE
This code was hosted on an old server. I created a public github repository to be the new permanent host.
http://github.com/mulicheng/Fireshare

34 Comments »

  1. im doing mca.
    can u help me about apriori algorithm source code in java how run that java program

    Comment by pallavi — March 3, 2008 @ 7:05 am

  2. Could you please help me about apriori algorithm source code in C or any language

    Comment by belal — March 3, 2008 @ 4:19 pm

  3. Could you please help me about apriori algorithm source code in C or any language, and send to my email please:

    belal_master2004@yahoo.com

    Comment by belal — March 3, 2008 @ 4:21 pm

  4. For Java, you might have a look at Weka.

    As far as providing sources in “any language”, isn’t that what this post already did? The sources are in JavaScript for this particular implementation.

    Comment by Dennis — March 3, 2008 @ 7:28 pm

  5. hello,
    I have big problem in apriori algorithm,that is I want to read data from .txt files (transaction file)
    and print out strong rules for dicisin maker
    but all codes must be in java language
    this is my project for college but there is
    no one help me.please, if you can help me send an email.
    thank you……

    Comment by lina — March 14, 2008 @ 9:58 am

  6. I couldn’t give much more help than I have here. The linked source code completely implements the algorithm. Just study it and translate it to Java.

    Comment by Dennis — March 14, 2008 @ 10:07 am

  7. thank you…

    Comment by lina — March 14, 2008 @ 10:12 am

  8. Could you please help me about apriori algorithm source code in C

    Comment by Rashmika — June 4, 2008 @ 3:33 am

  9. I suggest looking at this implementation in C:
    http://www.borgelt.net//apriori.html.

    The program works very well and the source is provided for you to look at.

    Comment by Dennis — June 9, 2008 @ 7:11 am

  10. im doing mca.
    can u help me about apriori algorithm source code in java how run that java program

    Comment by sathish.v — July 11, 2008 @ 11:37 am

  11. You might try
    Weka.

    Comment by Dennis — July 11, 2008 @ 11:38 am

  12. We have discovered new technique to mine frequent item set using graph but it only take much more memory but Time is reduced and Updating is very very easy when both Transactional Data Base & min-support count is changed………..Can I express it as IEEE paper… If any way
    there is to reduce memory consumption ….Help Me…

    Comment by S.NAGARAJAN — December 26, 2008 @ 3:49 am

  13. i need appriori algorithm to search frequently used nodes in mobile ad hoc networks .please provide this code as soon as possible which is executable in network simulator 2(NS2)

    Comment by viswampasupuleti — January 13, 2009 @ 7:57 am

  14. we have obtained a frequent pattern output from apriori algorithm ,we are stuck here can some one help us out how to generate association rule from this frequent pattern, to calculate support and confidence. We need code.please HELP!!!!

    Comment by Sridevi M.T — March 5, 2009 @ 1:22 am

  15. You’re certainly free to browse the source code I provided. It does indeed do association rules after frequent patterns are collected.

    Comment by Dennis — March 5, 2009 @ 7:33 am

  16. […] refer to the webpage to find article about it. You can download source code and project description from the codeplex […]

    Pingback by Association rules: Apriori algorithm - Data Mining source code - Microsoft User Group Винница — April 10, 2009 @ 9:26 am

  17. can you help me to generate apriori algorithm sourcecode in java.

    Comment by satya — October 1, 2009 @ 4:46 am

  18. There are already Java Apriori algorithms available. Weka for instance has an implementation.

    Comment by Dennis — October 5, 2009 @ 7:59 am

  19. Can you send me your example please
    I need for my paper
    thank
    lukman_ukI@yahoo.co.id

    Comment by lukman — April 9, 2010 @ 7:07 pm

  20. i am interested in viewing the js code you have but nothing opens when i click “view ” in notes , for old code and new code

    it only opens a blank site
    i am confused please help

    please mail me the code on my link
    will be thankful to you

    Comment by techsurge — April 10, 2010 @ 6:57 am

  21. Sorry my code repository is no longer working since I changed servers. All the relevant code for Apriori is on this page though. The rest of my code just tied the algorithm to my client software.

    Comment by Dennis — April 10, 2010 @ 9:14 am

  22. First things first, thanks for posting this. I wish I had needed/found your code 2 years ago when the your code repository was working.

    This statement might not be 100% true:
    “All the relevant code for Apriori is on this page…”

    In step #1 you mention:
    “You can view my completed code for the definitions of LargeItemSets and CandidateSet”

    So our definitions of “relevant code” may differ as I think the definition of “LargeItemSets” and “CandidateSet” might be relevant.

    That said, I’ll probably be able to look at the code you have provided and figure out the interface, but not the implementation, of your “LargeItemSets” and “CandidateSet”.

    Thanks again for your posting and if you can find a way to dig up the original full source code and make it available then, triple thanks!

    Comment by kevotheclone — June 18, 2010 @ 1:02 pm

  23. Sorry about that. I’ll give it a look through and see if there aren’t a few files I can post. The old server with my code repository is no longer running.

    Comment by Dennis — June 18, 2010 @ 1:24 pm

  24. Hey no problem, thanks for sharing what you did. I’ll subscribe to your feeds and check back on this page from time to time, but if you can’t get the files uploaded, don’t sweat it; your blog post is still helpful.

    Comment by kevotheclone — June 22, 2010 @ 11:36 am

  25. i need apriori association rule mining souce code in java. plz any one can help me.

    Comment by Poornima — February 17, 2011 @ 4:04 am

  26. You Indians need to figure things out on your own. Stop asking for everything to be handed to you.

    Comment by Buster Highmann — June 4, 2011 @ 8:14 pm

  27. Where can I find the older version ? The link seems to be broken, and here http://github.com/mulicheng/Fireshare is just the new one.

    Thank you!

    Comment by GeorgianaB — June 19, 2012 @ 6:06 am

  28. The old code site is no longer available. There isn’t anything on the old site that isn’t in the github repository though. What can’t you find?

    Comment by Dennis — June 19, 2012 @ 7:03 am

  29. Thanks so much for the post on apiori algorithm

    Comment by olahim — September 9, 2012 @ 4:10 pm

  30. hey! plz can u give in the code in javascript

    Comment by soum — December 22, 2012 @ 5:39 am

  31. How do I use this? Do I call dowork() while looping over each itemset in the dataset?

    Comment by Jake — January 31, 2013 @ 12:19 pm

  32. May I see the final output of the code?? Because im about to use the code to my project. But it’s not working.

    Comment by melvin — September 20, 2013 @ 8:18 pm

  33. All the results and the paper detailing the study are available here:
    http://www.fireshare.net/academic

    The code is a number of years out of date. While the concepts are still valid I doubt it would work out of the box in Firefox these days.

    Comment by Dennis — September 20, 2013 @ 8:22 pm

  34. Hello , I am using the MEAN stack and i use Mongoose for MongoDB i am wondering how to use the apriori algorithm with node.js and MongoDB , i have results in a database but i am new to algorithms and data structures.
    Can you help me with how to use Apriori in this context?
    Thanks

    Comment by Emir — January 18, 2017 @ 6:21 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress

css.php
%d bloggers like this: