Análise de associação de dados de compra - Uma aplicação em Market Basket Análises
Para ilustrar essa aplicão vamos utilizar o banco de dados Groceries, que está dentro do pacote arules do R.
Qual análise de cesta de compras utilizar? Os dados de compra coletados de operação de ecommerce em uma loja.
#install.packages("arules")
library(arules)
## Warning: package 'arules' was built under R version 3.5.1
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
# CARREGANDO A BASE DE DADOS
data(Groceries)
#Descritiva
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
# Vendo as primeiras 5 transa??es
inspect(Groceries[1:5])
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
# Visualizando os 20 itens mais frequentes graficamente (valores absolutos e frequencia)
itemFrequencyPlot(Groceries,topN=20,type="absolute")
itemFrequencyPlot(Groceries,topN=20,type="relative")
Agora estamos prontos para testar algumas regras! Você sempre terá que passar pelo mínimo necessário de suporte e confiança.
Em uma primeira tentativa testamos:
Suporte m?nimo em 0,001
Confian?a m?nima em de 0,8
Em seguida, mostramos as 5 principais regras
# Criando a regra 1 utilizando a função apriori
Regras1 <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Mostrando as 5 primeiras linhas e fixando a visualização do output
options(digits=2)
inspect(Regras1[1:5])
## lhs rhs support confidence lift
## [1] {liquor,red/blush wine} => {bottled beer} 0.0019 0.90 11.2
## [2] {curd,cereals} => {whole milk} 0.0010 0.91 3.6
## [3] {yogurt,cereals} => {whole milk} 0.0017 0.81 3.2
## [4] {butter,jam} => {whole milk} 0.0010 0.83 3.3
## [5] {soups,bottled beer} => {whole milk} 0.0011 0.92 3.6
## count
## [1] 19
## [2] 10
## [3] 17
## [4] 10
## [5] 11
Obtemos informações resumidas sobre as regras que nos dão algumas informaões interessantes, tais como:
O número de regras geradas: 410 A distribuição de regras por tamanho: a maioria das regras tem 4 itens O resumo das medidas de qualidade: interessante ver intervalos de apoio, sustentação e confiança.
A informação sobre os dados extraídos: dados totais extraídos e parâmetros mínimos. Por exemplo: se alguém compra iogurte e cereais, é provável que 81% dos clientes comprem leite integral também.
Pode-se finir melhores níveis de suporte e confiança para descubrirmos mais regras:
summary(Regras1)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 4.0 4.0 4.3 5.0 6.0
##
## summary of quality measures:
## support confidence lift count
## Min. :0.00102 Min. :0.80 Min. : 3.1 Min. :10.0
## 1st Qu.:0.00102 1st Qu.:0.83 1st Qu.: 3.3 1st Qu.:10.0
## Median :0.00122 Median :0.85 Median : 3.6 Median :12.0
## Mean :0.00125 Mean :0.87 Mean : 4.0 Mean :12.3
## 3rd Qu.:0.00132 3rd Qu.:0.91 3rd Qu.: 4.3 3rd Qu.:13.0
## Max. :0.00315 Max. :1.00 Max. :11.2 Max. :31.0
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.001 0.8
#Criando a regra 2 e definindo tamanhos de itens de interesse
regra2 <- apriori(Groceries, parameter = list(supp=0.002, conf=0.80, minlen = 4, maxlen=6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.002 4
## maxlen target ext
## 6 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(regra2)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 4 5
## 3 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 4.0 5.0 4.6 5.0 5.0
##
## summary of quality measures:
## support confidence lift count
## Min. :0.00203 Min. :0.80 Min. :3.2 Min. :20.0
## 1st Qu.:0.00203 1st Qu.:0.81 1st Qu.:3.2 1st Qu.:20.0
## Median :0.00224 Median :0.82 Median :3.3 Median :22.0
## Mean :0.00236 Mean :0.83 Mean :3.6 Mean :23.2
## 3rd Qu.:0.00247 3rd Qu.:0.84 3rd Qu.:4.1 3rd Qu.:24.2
## Max. :0.00315 Max. :0.89 Max. :4.6 Max. :31.0
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.002 0.8
inspect(regra2)
## lhs rhs support confidence lift count
## [1] {tropical fruit,
## grapes,
## whole milk} => {other vegetables} 0.0020 0.80 4.1 20
## [2] {other vegetables,
## curd,
## domestic eggs} => {whole milk} 0.0028 0.82 3.2 28
## [3] {pork,
## other vegetables,
## butter} => {whole milk} 0.0022 0.85 3.3 22
## [4] {root vegetables,
## other vegetables,
## yogurt,
## fruit/vegetable juice} => {whole milk} 0.0020 0.83 3.3 20
## [5] {root vegetables,
## whole milk,
## yogurt,
## fruit/vegetable juice} => {other vegetables} 0.0020 0.80 4.1 20
## [6] {citrus fruit,
## tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.0032 0.89 4.6 31
## [7] {citrus fruit,
## root vegetables,
## other vegetables,
## yogurt} => {whole milk} 0.0023 0.82 3.2 23
## [8] {tropical fruit,
## root vegetables,
## yogurt,
## rolls/buns} => {whole milk} 0.0022 0.81 3.2 22
inspect(sort(regra2, by = "lift"))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.0032 0.89 4.6 31
## [2] {tropical fruit,
## grapes,
## whole milk} => {other vegetables} 0.0020 0.80 4.1 20
## [3] {root vegetables,
## whole milk,
## yogurt,
## fruit/vegetable juice} => {other vegetables} 0.0020 0.80 4.1 20
## [4] {pork,
## other vegetables,
## butter} => {whole milk} 0.0022 0.85 3.3 22
## [5] {root vegetables,
## other vegetables,
## yogurt,
## fruit/vegetable juice} => {whole milk} 0.0020 0.83 3.3 20
## [6] {other vegetables,
## curd,
## domestic eggs} => {whole milk} 0.0028 0.82 3.2 28
## [7] {citrus fruit,
## root vegetables,
## other vegetables,
## yogurt} => {whole milk} 0.0023 0.82 3.2 23
## [8] {tropical fruit,
## root vegetables,
## yogurt,
## rolls/buns} => {whole milk} 0.0022 0.81 3.2 22
regra3 <- apriori( Groceries, parameter = list(supp = 0.002, conf = 0.7, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.002 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [94 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(regra3)
## set of 94 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5
## 22 59 13
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 4.0 4.0 3.9 4.0 5.0
##
## summary of quality measures:
## support confidence lift count
## Min. :0.0020 Min. :0.70 Min. :2.7 Min. :20
## 1st Qu.:0.0021 1st Qu.:0.71 1st Qu.:2.8 1st Qu.:21
## Median :0.0024 Median :0.74 Median :3.0 Median :24
## Mean :0.0026 Mean :0.75 Mean :3.2 Mean :26
## 3rd Qu.:0.0027 3rd Qu.:0.77 3rd Qu.:3.5 3rd Qu.:27
## Max. :0.0057 Max. :0.89 Max. :4.6 Max. :56
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.002 0.7
inspect(sort(regra3[1:20], decreasing = TRUE, by = "lift"))
## lhs rhs support confidence lift count
## [1] {whipped/sour cream,
## soft cheese} => {other vegetables} 0.0022 0.73 3.8 22
## [2] {root vegetables,
## soft cheese} => {other vegetables} 0.0024 0.73 3.8 24
## [3] {citrus fruit,
## herbs} => {other vegetables} 0.0021 0.72 3.7 21
## [4] {root vegetables,
## baking powder} => {other vegetables} 0.0025 0.71 3.7 25
## [5] {root vegetables,
## rice} => {other vegetables} 0.0022 0.71 3.7 22
## [6] {tropical fruit,
## herbs} => {whole milk} 0.0023 0.82 3.2 23
## [7] {hamburger meat,
## curd} => {whole milk} 0.0025 0.81 3.2 25
## [8] {herbs,
## rolls/buns} => {whole milk} 0.0024 0.80 3.1 24
## [9] {root vegetables,
## rice} => {whole milk} 0.0024 0.77 3.0 24
## [10] {butter milk,
## whipped/sour cream} => {whole milk} 0.0029 0.76 3.0 29
## [11] {onions,
## butter} => {whole milk} 0.0031 0.75 2.9 30
## [12] {butter,
## soft cheese} => {whole milk} 0.0020 0.74 2.9 20
## [13] {cream cheese ,
## sugar} => {whole milk} 0.0020 0.74 2.9 20
## [14] {butter,
## curd} => {whole milk} 0.0049 0.72 2.8 48
## [15] {yogurt,
## specialty cheese} => {whole milk} 0.0020 0.71 2.8 20
## [16] {dessert,
## butter milk} => {whole milk} 0.0020 0.71 2.8 20
## [17] {domestic eggs,
## sugar} => {whole milk} 0.0036 0.71 2.8 35
## [18] {yogurt,
## baking powder} => {whole milk} 0.0033 0.71 2.8 32
## [19] {whipped/sour cream,
## sliced cheese} => {whole milk} 0.0027 0.71 2.8 27
## [20] {butter,
## coffee} => {whole milk} 0.0034 0.70 2.7 33
regra_beef <- subset(regra3, items %in% "beef")
inspect(regra_beef)
## lhs rhs support confidence lift count
## [1] {beef,
## other vegetables,
## domestic eggs} => {whole milk} 0.0025 0.76 3.0 25
## [2] {beef,
## tropical fruit,
## root vegetables} => {other vegetables} 0.0027 0.73 3.8 27
## [3] {beef,
## tropical fruit,
## rolls/buns} => {whole milk} 0.0021 0.78 3.0 21
Segmnetacao 1 - O que os clientes compram antes de comprar um determinado produto (beef) ?
regra3_seg1 <- apriori( Groceries, parameter = list(sup = 0.002, conf = 0.2),
appearance = list(default = "lhs", rhs="beef"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [16 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(regra3_seg1)
## lhs rhs support confidence lift count
## [1] {pork,
## root vegetables} => {beef} 0.0027 0.20 3.8 27
## [2] {root vegetables,
## butter} => {beef} 0.0029 0.23 4.4 29
## [3] {root vegetables,
## newspapers} => {beef} 0.0027 0.24 4.6 27
## [4] {citrus fruit,
## root vegetables} => {beef} 0.0039 0.22 4.2 38
## [5] {root vegetables,
## soda} => {beef} 0.0040 0.21 4.1 39
## [6] {root vegetables,
## rolls/buns} => {beef} 0.0050 0.21 3.9 49
## [7] {pork,
## other vegetables,
## whole milk} => {beef} 0.0023 0.23 4.4 23
## [8] {root vegetables,
## whole milk,
## butter} => {beef} 0.0020 0.25 4.7 20
## [9] {other vegetables,
## whole milk,
## domestic eggs} => {beef} 0.0025 0.21 3.9 25
## [10] {citrus fruit,
## root vegetables,
## other vegetables} => {beef} 0.0021 0.21 3.9 21
## [11] {citrus fruit,
## root vegetables,
## whole milk} => {beef} 0.0022 0.24 4.7 22
## [12] {tropical fruit,
## root vegetables,
## other vegetables} => {beef} 0.0027 0.22 4.3 27
## [13] {tropical fruit,
## root vegetables,
## whole milk} => {beef} 0.0025 0.21 4.0 25
## [14] {root vegetables,
## other vegetables,
## soda} => {beef} 0.0020 0.25 4.7 20
## [15] {root vegetables,
## other vegetables,
## rolls/buns} => {beef} 0.0028 0.23 4.4 28
## [16] {root vegetables,
## whole milk,
## rolls/buns} => {beef} 0.0028 0.22 4.3 28
Segmentacao 2 - O que os clientes compram depois de comprar um determinado produto (beef) ?
regra3_seg2 <- apriori( Groceries, parameter = list(sup = 0.002, conf = 0.2),
appearance = list(default = "rhs", lhs="beef"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(regra3_seg2)
## lhs rhs support confidence lift count
## [1] {} => {whole milk} 0.256 0.26 1.0 2513
## [2] {beef} => {root vegetables} 0.017 0.33 3.0 171
## [3] {beef} => {yogurt} 0.012 0.22 1.6 115
## [4] {beef} => {rolls/buns} 0.014 0.26 1.4 134
## [5] {beef} => {other vegetables} 0.020 0.38 1.9 194
## [6] {beef} => {whole milk} 0.021 0.41 1.6 209
#install.packages("arulesViz")
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 3.5.2
## Loading required package: grid
#Visualização gráfica das regras criadas, tamém é possivel utilizar o método interativo
plot(regra3_seg2, method = "graph")
plot(regra3_seg2, method = "graph", interactive = TRUE)
## Warning in plot.rules(regra3_seg2, method = "graph", interactive = TRUE):
## The parameter interactive is deprecated. Use engine='interactive' instead.