What your customers may like most — Apriori Algorithm

Ritresh Girdhar
4 min readMar 10, 2022

--

Sharing my knowledge about unsupervised learning in Data mining with the simplest algorithm which we used to generate associated rules to determine the related grocery items customers bought from our e-commerce application/retail stores.

Before jumping ahead, Let’s understand few terms which I will be using in this article.

  • Frequent item set — Meaning items which bought together by customers.
  • Unsupervised Learning — Predict something without having prior knowledge.
  • Sampling — Statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns
  • Noise — Meaningless information For ex. 123 in list of grocery, which is meaningless.
  • Data discretization — converting a huge number of data values into smaller ones
  • Pruned — change the model by deleting the nodes/transaction

The name of the algorithm is Apriori because it uses prior knowledge of frequent item set properties. We apply an iterative approach or level-wise search where k-frequent item sets are used to find k+1 item sets.

To improve the efficiency of level-wise generation of frequent item sets, an important property is used called Apriori property which helps by reducing the search space.

Algorithm Says:

  • Let k=1
  • Generate frequent item sets of length 1
  • Repeat until no new frequent item sets are identified
  • Generate length (k+1) candidate item sets from length k frequent item sets
  • Prune candidate item sets containing subsets of length k that are infrequent
  • Count the support of each candidate by scanning the DB
  • Eliminate candidates that are infrequent, leaving only those that are frequent

Here is the Github repo for reference

Approach

  1. Sampling — Divide the provided dataset into N datasets either random or using some pattern or shuffling. Repeat execution with multiple random datasets. Compare the Rules generated
  2. Data processing — Apply Discretization, cleaning on the dataset to remove noisy transactions.
  3. Generate Item set for the provided transactions with 60% Min Support.

Support(A) = (Transactions containing (A))/(Total Transactions)

Pruned rules with confidence ≤75%

Time Complexity — of generating number of Frequent Item set is O(NMw)

N — Number of transactions

M — Number of unique items

w — max number of unique items in single transaction

Apriori principle:

– If an item-set is frequent, then all of its subsets must also be frequent

Apriori property —Support of an item-set never exceeds the support of its subsets.

** Ideal way is to not use a single Minimum support threshold value as: **

  • High Minimum Support — Would result in less number of Frequent Item Set.
  • Low Minimum Support — This would result in too many frequent Item set and extra exponential processing.

Support count value depends on nature of application For ex:

  • Medical domain — While building a recommendation engine about medicine based on patients symptoms, prefer high value to get more accurate related values
  • E-commerce recommendation engines — prefer low values, to get more related customers data. It will definitely grow the sales.

There are multiple parameters to reduce the processing — confidence , lift.

Confidence(A→B) = (Transactions containing (A and B))/(Transactions containing only A)

Lift(A→B) = (Confidence (A→B))/(Support (B))

Thanks for reading, Refer below mentioned books to get more idea:

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar — Introduction to Data Mining-Pearson (2005)
  • The Morgan Kaufmann series in data management systems) Jiawei Han, Micheline Kamber, Jian Pei — Data mining _ concepts and techniques-Elsevier, Morgan Kaufmann (2012)

--

--

Ritresh Girdhar
Ritresh Girdhar

Written by Ritresh Girdhar

Father || Coder || Engineer || Learner || Reader || Writer || Silent Observer

No responses yet