Transaction and Customer Data Analysis
Transaction and Customer Data Analysis
IMPORT LIBRARIES
TRANSACTION DATASET
In [49]: # Transaction
df_transaction = pd.read_excel("QVI_transaction_data.xlsx")
df_transaction.head()
Natural Chip
0 43390 1 1000 1 5 Compny
SeaSalt175g
CCs Nacho
1 43599 1 1307 348 66
Cheese 175g
Smiths Crinkle
2 43605 1 1343 383 61 Cut Chips
Chicken 170g
Smiths Chip
Thinly
3 43329 2 2373 974 69
S/Cream&Onion
175g
Kettle Tortilla
4 43330 2 2426 1038 108 ChpsHny&Jlpno
Chili 150g
CUSTOMER DATASET
In [52]: # Customer
df_customer = pd.read_csv("QVI_purchase_behaviour.csv")
df_customer.head()
MERGE DATASETS
Natural Chip
0 43390 1 1000 1 5 Compny
SeaSalt175g
CCs Nacho
1 43599 1 1307 348 66
Cheese 175g
Smiths Crinkle
2 43605 1 1343 383 61 Cut Chips
Chicken 170g
Smiths Chip
Thinly
3 43329 2 2373 974 69
S/Cream&Onion
175g
Kettle Tortilla
4 43330 2 2426 1038 108 ChpsHny&Jlpno
Chili 150g
Natural Chip
0 43390 1 1000 1 5 Compny 2 6.0
SINGLES/
SeaSalt175g
CCs Nacho
1 43599 1 1307 348 66 3 6.3
Cheese 175g SINGLES/
Smiths Crinkle
2 43605 1 1343 383 61 Cut Chips 2 2.9
SINGLES/
Chicken 170g
Smiths Chip
Thinly
3 43329 2 2373 974 69 5 15.0
S/Cream&Onion SINGLES/
175g
Kettle Tortilla
4 43330 2 2426 1038 108 ChpsHny&Jlpno 3 13.8
SINGLES/
Chili 150g
EXAMINE DATATYPES
In [55]: df.dtypes
In [56]: # 15 Dates
date_offsets = df["Date"].to_list()
base_date = pd.Timestamp("1899-12-30") # Start Date
df["Date"] = [base_date + pd.DateOffset(date_offset) for date_offset in date_off
df["Date"][0:15]
Out[56]: 0 2018-10-17
1 2019-05-14
2 2019-05-20
3 2018-08-17
4 2018-08-18
5 2019-05-19
6 2019-05-16
7 2019-05-16
8 2018-08-20
9 2018-08-18
10 2019-05-17
11 2018-08-20
12 2019-05-18
13 2018-08-17
14 2019-05-15
Name: Date, dtype: datetime64[ns]
PRODUCTS SUMMARY
FREQUENCY OF PRODUCTS
In [70]: word_counts = {}
def count_words(line) :
for word in line :
if word not in word_counts :
word_counts[word] = 1
else :
word_counts[word] += 1
split_prods.apply(lambda line : count_words(line))
print(pd.Series(word_counts).sort_values(ascending=False))
175g 60561
Chips 49770
150g 41633
Kettle 41288
& 35565
...
Whlegrn 1432
Pc 1431
NCC 1419
Garden 1419
Fries 1418
Length: 220, dtype: int64
In [71]: df = df[~df["Product"].str.contains(r"[Ss]alsa")]
df.head()
Natural Chip
2018-
0 1 1000 1 5 Compny 2 6.0
10-17 SINGLES/
SeaSalt175g
Smiths Crinkle
2019-
2 1 1343 383 61 Cut Chips 2 2.9
05-20 SINGLES/
Chicken 170g
Smiths Chip
2018- Thinly
3 2 2373 974 69 5 15.0
08-17 S/Cream&Onion SINGLES/
175g
Kettle Tortilla
2018-
4 2 2426 1038 108 ChpsHny&Jlpno 3 13.8
08-18 SINGLES/
Chili 150g
In [72]: df.describe()
Out[72]: Transaction
Date Store Id Card No. Product Id
Id
2018-12-30
mean 135.051098 1.355310e+05 1.351311e+05 56.351789
01:19:01.211467520
2018-07-01
min 1.000000 1.000000e+03 1.000000e+00 1.000000
00:00:00
2018-09-30
25% 70.000000 7.001500e+04 6.756925e+04 26.000000
00:00:00
2018-12-30
50% 130.000000 1.303670e+05 1.351830e+05 53.000000
00:00:00
2019-03-31
75% 203.000000 2.030840e+05 2.026538e+05 87.000000
00:00:00
2019-06-30
max 272.000000 2.373711e+06 2.415841e+06 114.000000
00:00:00
Dorito
2018- Corn Chp
69762 226 226000 226201 4 200 650.0 OLDE
08-19 Supreme
380g
Dorito
2019- Corn Chp
69763 226 226000 226210 4 200 650.0 OLDE
05-20 Supreme
380g
Doritos
2018-
32226 62 62057 58117 51 Mexicana 5 22.0 OLDE
08-19
170g
Infuzions
Mango
2019-
69541 86 86089 84699 38 Chutny 5 12.0 OLDE
05-16
Papadums
70g
RRD
Sweet
2019- Chilli &
238227 180 180111 181705 53 5 15.0
05-19 Sour SINGLE
Cream
165g
Out[76]: 2
IT LOOKS LIKE THIS CUSTOMER HAS ONLY HAD THE TWO TRANSACTIONS OVER THE
YEAR AND IS NOT AN ORDINARY RETAIL CUSTOMER. THE CUSTOMER MIGHT BE
BUYING CHIPS FOR COMMERCIAL PURPOSES INSTEAD. WE WILL REMOVE THIS LOYALTY
CARD NUMBER FROM FURTHER ANALYSIS.
Natural Chip
2018-
0 1 1000 1 5 Compny 2 6.0
10-17 SINGLES/
SeaSalt175g
Smiths Crinkle
2019-
2 1 1343 383 61 Cut Chips 2 2.9
05-20 SINGLES/
Chicken 170g
Smiths Chip
2018- Thinly
3 2 2373 974 69 5 15.0
08-17 S/Cream&Onion SINGLES/
175g
Kettle Tortilla
2018-
4 2 2426 1038 108 ChpsHny&Jlpno 3 13.8
08-18 SINGLES/
Chili 150g
Out[78]: Date
2018-12-24 865
2018-12-23 853
2018-12-22 840
2018-12-19 839
2018-12-20 808
...
2019-06-24 612
2018-10-18 611
2018-11-25 610
2018-09-22 609
2019-06-13 607
Name: count, Length: 364, dtype: int64
INSTEAD OF 365, THE DATE COLUMN ONLY HAS 364 UNIQUE VALUES. 1 IS MISSING.
Smiths
2018- Crnkle Chip
0 47.0 47142.0 42540.0 14.0 2.0 11.8
07-01 Orgnl Big SINGLES/C
Bag 380g
Pringles
2018- Sthrn
1 55.0 55073.0 48884.0 99.0 2.0 7.4
07-01 FriedChicken SINGLES/C
134g
Kettle
2018- Mozzarella
3 58.0 58351.0 54374.0 102.0 2.0 10.8
07-01 Basil & SINGLES/C
Pesto 175g
Thins Chips
2018-
4 68.0 68193.0 65598.0 44.0 Light& 2.0 6.6
07-01 SINGLES/C
Tangy 175g
0 175.0
1 175.0
2 170.0
3 175.0
4 150.0
...
264831 175.0
264832 175.0
264833 170.0
264834 150.0
264835 175.0
Name: 0, Length: 246740, dtype: float64
In [91]: pack_size.describe()
In [94]: pack_size.plot.hist()
plt.show()
In [96]: df["Product"].str.split().str[0].value_counts()
Out[96]: Product
Kettle 41288
Smiths 27390
Pringles 25102
Doritos 22041
Thins 14075
RRD 11894
Infuzions 11057
WW 10320
Cobs 9693
Tostitos 9471
Twisties 9454
Tyrrells 6442
Grain 6272
Natural 6050
Cheezels 4603
CCs 4551
Red 4427
Dorito 3183
Infzns 3144
Smith 2963
Cheetos 2927
Snbts 1576
Burger 1564
Woolworths 1516
GrnWves 1468
Sunbites 1432
NCC 1419
French 1418
Name: count, dtype: int64
Natural Chip
2018-
0 1 1000 1 5 Compny 2 6.0
10-17 SINGLES/
SeaSalt175g
Smiths Crinkle
2019-
2 1 1343 383 61 Cut Chips 2 2.9
05-20 SINGLES/
Chicken 170g
Smiths Chip
2018- Thinly
3 2 2373 974 69 5 15.0
08-17 S/Cream&Onion SINGLES/
175g
Kettle Tortilla
2018-
4 2 2426 1038 108 ChpsHny&Jlpno 3 13.8
08-18 SINGLES/
Chili 150g
Natural Chip
2018-
0 1 1000 1 5 Compny 2 6.0
10-17 SINGLES/
SeaSalt175g
Smiths Crinkle
2019-
2 1 1343 383 61 Cut Chips 2 2.9
05-20 SINGLES/
Chicken 170g
Smiths Chip
2018- Thinly
3 2 2373 974 69 5 15.0
08-17 S/Cream&Onion SINGLES/
175g
Kettle Tortilla
2018-
4 2 2426 1038 108 ChpsHny&Jlpno 3 13.8
08-18 SINGLES/
Chili 150g
DATA ANALYSIS
---> WHO SPENDS THE MOST ON CHIPS (TOTAL SALES), DESCRIBING CUSTOMERS BY
LIFESTAGE AND HOW PREMIUM THEIR GENERAL PURCHASING BEHAVIOUR IS ---> HOW
MANY CUSTOMERS ARE IN EACH SEGMENT ---> HOW MANY CHIPS ARE BOUGHT PER
CUSTOMER BY SEGMENT ---> WHAT IS THE AVERAGE CHIPS PRICE BY CUSTOMER
SEGMENT
In [ ]: # 1/ WHO SPENDS THE MOST ON CHIPS (TOTAL SALES), DESCRIBING CUSTOMERS BY LIFESTA
most_shopping = df.groupby(["Group", "Subscription"])["Sales"].agg(["sum"]).sort
most_shopping
Out[ ]: sum
Group Subscription
Mainstream 124648.50
Premium 123537.55
Mainstream 15979.70
Premium 10760.80
Group Subscription
Budget 4849
Premium 4682
Mainstream 830
Premium 575
Out[ ]: Date
Out[113… mean
Group Subscription
Budget 4.665799
Premium 4.662931
Budget 4.493549
Mainstream 4.449534
Premium 3.536950
Mainstream 3.511939
Budget 2.597976
Premium 2.587826
Premium 2.359677
Budget 2.350699
1/ DESPITE OLDER FAMILIES NOT HAVING THE HIGHEST POPULATION, THEY HAVE THE
HIGHEST FREQUENCY OF PURCHASE, WHICH CONTRIBUTES TO THEIR HIGH TOTAL
SALES. 2/ OLDER FAMILIES FOLLOWED BY YOUNG FAMILIES HAS THE HIGHEST AVERAGE
QUANTITY OF CHIPS BOUGHT PER PURCHASE.
Out[116… mean
Group Subscription
Budget 7.108442
Budget 6.663023
THE Mainstream CATEGORY OF THE "YOUNG & MIDAGE SINGLES/COUPLES" HAVE THE
HIGHEST SPENDING OF CHIPS PER PURCHASE. AND THE DIFFERENCE TO THE non-
Mainstream "YOUNG & MIDAGE SINGLES/COUPLES" ARE STATISTICALLY SIGNIFICANT.
T-TEST
1.834645908180742e-237
Out[118… np.True_
Out[119… Brand
Group Subscription
Mainstream Kettle
Budget Kettle
Mainstream Kettle
Budget Kettle
Mainstream Kettle
Budget Kettle
Mainstream Kettle
Budget Kettle
Mainstream Kettle
EVERY SEGMENT HAD "Kettle" AS THE MOST PURCHASED BRAND. EVERY SEGMENT
EXCEPT "YOUNG SINGLES/COUPLES Mainstream" HAD "Smiths" AS THEIR SECOND
MOST PURCHASED BRAND. "YOUNG SINGLES/COUPLES Mainstream" HAD "Doritos" AS
AS THEIR SECOND MOST PURCHASED BRAND.
Natural Chip
2018-
0 1 1000 1 5 Compny 2 6.0
10-17 SINGLES/
SeaSalt175g
Smiths Crinkle
2019-
2 1 1343 383 61 Cut Chips 2 2.9
05-20 SINGLES/
Chicken 170g
Smiths Chip
2018- Thinly
3 2 2373 974 69 5 15.0
08-17 S/Cream&Onion SINGLES/
175g
Kettle Tortilla
2018-
4 2 2426 1038 108 ChpsHny&Jlpno 3 13.8
08-18 SINGLES/
Chili 150g
MOST FREQUENT CHIPS SIZE PURCHASED IS 175g FOLLOWED BY THE 150g CHIPS SIZE
FOR ALL SEGMENTS.
Out[125… 0
Group Subscription
Budget 9.076773
Premium 9.071717
Premium 8.716013
Mainstream 8.638361
Premium 6.769543
Mainstream 6.712021
Premium 6.103358
Budget 6.026459
Budget 4.821527
Premium 4.815652
Premium 4.264113
Budget 4.250069
In [ ]: # Calculate "Unit Price" only where "Quantity" is non-zero and both columns are
df["Unit Price"] = df["Sales"] / df["Quantity"].replace(0, pd.NA)
# Group by "Group" and "Subscription" and calculate the mean of "Unit Price"
chips_segment = df.groupby(["Group", "Subscription"], dropna=False)["Unit Price"
# Display the resulting DataFrame
chips_segment
Group Subscription
Premium 3.920942
Mainstream 3.916133
Budget 3.882096
Budget 3.760737
Budget 3.657366
RECOMMENDATIONS
4/ GENERAL - All segments has "Kettle" as the most frequently purchased brand, and
175g (regardless of brand) followed by 150g as the preferred chips size. - When
promoting chips in general to all segments it is good to take advantage of these two
points.