在一个真实的商业场景中,经营者通常需要:
了解他当时所面对的经营情况,既:对比历史,了解现在分析原因,既:洞察真相制定策略,提高增长,既:规划和执行检查结果,既:反馈循环 这四步需要循环迭代以使商业经营不断纠错、不断适应新的经济环境的变化。 但是在大数据分析、模型构建等技术发展时代,我们通常会借助人工智能和数据分析手段来辅助完成。
本文开始以python语言进行数据驱动、客户分析的实践,内容包括
1 了解数据,定义指标
2 客户分段分析
3 CLV 预测分析 4 客户流失预测分析
5 下一个购买日预测
6 预测销量
7 市场响应模型
8 提升模型
9 A/B 测试设计和执行
本系列主要是围绕前面所讨论的CLV和Customer Segmentaion等进行实践,主要技术涉及数据分析和机器学习等。 本文为第一篇,主要是在探索数据的同时,我们学习和了解一个重要的指标类 North Star Metrics.
一. 了解数据,定义指标
每一位企业领导者都希望为企业带来增长,但是如何做到呢?是需要更多的客户、更多的订单、更高的营收还是更高的效率呢?答案可能是不明确的,可能都需要,但是什么是最重要的呢?每一个企业都需要清楚定位所需解决的核心问题,而在数据分析领域的第一步就是找到我们需要回答的那个正确的问题。
这里我们介绍北极星指标(North Star Metric),一个描述你的产品给客户带来的核心价值的指标,它依赖于具体的产品、目标、行业位置等。例如,Airbnb的北极星指标是每晚的订房数量,而Facebook是日活跃用户。
这里,我们将使用一个在线零售数据集。对于在线零售业,北极星指标为月收入(Monthly Revenue).
月收入 导入包# import libraries from __future__ import division from datetime import datetime, timedelta import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns import chart_studio.plotly as py import plotly.offline as pyoff import plotly.graph_objs as go %matplotlib inline读入数据
pyoff.init_notebook_mode() tx_data = pd.read_excel('online Retail.xlsx') tx_data.head(10)
这里使用了一份在线零售数据集,其中的关键信息为
CustomerIDUnitPriceQuantityInvoiceNo
我们在衡量用户价值的同时,首先要先判断一下我们的产品能够给用户提供的核心价值是什么,这需要一个量化指标。基于这个目的,我们介绍North Star Metrics.
看一下Sean Ellis对North Star Metrics的描述:
The North Star Metric is the single metric that best captures the core value that your product delivers to customers.
笔者理解,North Star Metrics 北极星指数是依赖于公司的产品、所处环境、目标等。Airbnb的北极星指数是夜间订阅数,而Facebook是日活跃用户。对于零售数据,我们可以选择月收入作为北极星指标。
月收入=活跃用户数×交易单数×每单平均收入
北极星指标等式为
푅
푒
푣
푒
푛
푢
푒
=
퐴
푐
푡
푖
푣
푒
퐶
푢
푠
푡
표
푚
푒
푟
퐶
표
푢
푛
푡
×
푂
푟
푑
푒
푟
푐
표
푢
푛
푡
×
퐴
푣
푒
푟
푎
푔
푒
푅
푒
푣
푒
푛
푢
푒
푝
푒
푟
표
푟
푑
푒
푟
푅푒푣푒푛푢푒=퐴푐푡푖푣푒 퐶푢푠푡표푚푒푟 퐶표푢푛푡×푂푟푑푒푟 푐표푢푛푡×퐴푣푒푟푎푔푒 푅푒푣푒푛푢푒 푝푒푟 표푟푑푒푟
Revenue=ActiveCustomerCount×Ordercount×AverageRevenueperorder
#converting the type of Invoice Date Field from string to datetime. tx_data['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate']) #creating YearMonth field for the ease of reporting and visualization tx_data['InvoiceYearMonth'] = tx_data['InvoiceDate'].map(lambda date: 100*date.year + date.month) #calculate Revenue for each row and create a new dataframe with YearMonth - Revenue columns tx_data['Revenue'] = tx_data['UnitPrice'] * tx_data['Quantity'] tx_revenue = tx_data.groupby(['InvoiceYearMonth'])['Revenue'].sum().reset_index() tx_revenue可视化
#X and Y axis inputs for Plotly graph. We use Scatter for line graphs plot_data = [ go.Scatter( x=tx_revenue['InvoiceYearMonth'], y=tx_revenue['Revenue'], ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='Montly Revenue' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
很明显月收入在8月份开始大幅增长,12月份回落是因为数据不完整。
接下来,我们看一下月收入增长率的情况
#using pct_change() function to see monthly percentage change tx_revenue['MonthlyGrowth'] = tx_revenue['Revenue'].pct_change() #showing first 5 rows tx_revenue.head() #visualization - line graph plot_data = [ go.Scatter( x=tx_revenue.query("InvoiceYearMonth < 201112")['InvoiceYearMonth'], y=tx_revenue.query("InvoiceYearMonth < 201112")['MonthlyGrowth'], ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='Montly Growth Rate' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
可以看到,11月份增长率达到36.5%。但是我们需要详细调查一下四月份,为什么这个月出现大幅下降。
从这里开始,我们聚焦在UK的销售纪录。用CustomerID来唯一确定用户。
#creating a new dataframe with UK customers only tx_uk = tx_data.query("Country=='United Kingdom'").reset_index(drop=True) #creating monthly active customers dataframe by counting unique Customer IDs tx_monthly_active = tx_uk.groupby('InvoiceYearMonth')['CustomerID'].nunique().reset_index() #plotting the output plot_data = [ go.Bar( x=tx_monthly_active['InvoiceYearMonth'], y=tx_monthly_active['CustomerID'], ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='Monthly Active Customers' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
tx_monthly_active
四月份月活用户从三月份的923跌落到817,-11.5%.
在月交易单量的分析中,我们看到了同样趋势。
tx_monthly_sales = tx_uk.groupby('InvoiceYearMonth')['Quantity'].sum().reset_index() tx_monthly_sales
plot_data = [ go.Bar( x=tx_monthly_sales['InvoiceYearMonth'], y=tx_monthly_sales['Quantity'], ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='Monthly Total # Order' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
正如所料,交易单数在四月份也下降了,从279K到257K,-8%.
我们再来看一下单个订单平均收入。
# create a new dataframe for average revenue by taking the mean of it tx_monthly_order_avg = tx_uk.groupby('InvoiceYearMonth')['Revenue'].mean().reset_index() #plot the bar chart plot_data = [ go.Bar( x=tx_monthly_order_avg['InvoiceYearMonth'], y=tx_monthly_order_avg['Revenue'], ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='Monthly Order Average' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
#print the dataframe tx_monthly_order_avg
四月份也下降了,从16.7到15.8.
到现在为止,我们已经观察了一些重要的指标。
行业上还有其他一些相对重要的指标,下面我们来调查一下:
新客户比例留存率 新客户比例
首先我们需要定义一下什么是新客户。所谓新客户就是在当月是其第一次消费的客户。
#create a dataframe contaning CustomerID and first purchase date tx_min_purchase = tx_uk.groupby('CustomerID').InvoiceDate.min().reset_index() tx_min_purchase.columns = ['CustomerID','MinPurchaseDate'] tx_min_purchase['MinPurchaseYearMonth'] = tx_min_purchase['MinPurchaseDate'].map(lambda date: 100*date.year + date.month) #merge first purchase date column to our main dataframe (tx_uk) tx_uk = pd.merge(tx_uk, tx_min_purchase, on='CustomerID') tx_uk.head() #create a column called User Type and assign Existing #if User's First Purchase Year Month before the selected Invoice Year Month tx_uk['UserType'] = 'New' tx_uk.loc[tx_uk['InvoiceYearMonth']>tx_uk['MinPurchaseYearMonth'],'UserType'] = 'Existing' #calculate the Revenue per month for each user type tx_user_type_revenue = tx_uk.groupby(['InvoiceYearMonth','UserType'])['Revenue'].sum().reset_index() #filtering the dates and plot the result tx_user_type_revenue = tx_user_type_revenue.query("InvoiceYearMonth != 201012 and InvoiceYearMonth != 201112") plot_data = [ go.Scatter( x=tx_user_type_revenue.query("UserType == 'Existing'")['InvoiceYearMonth'], y=tx_user_type_revenue.query("UserType == 'Existing'")['Revenue'], name = 'Existing' ), go.Scatter( x=tx_user_type_revenue.query("UserType == 'New'")['InvoiceYearMonth'], y=tx_user_type_revenue.query("UserType == 'New'")['Revenue'], name = 'New' ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='New vs Existing' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
老客户有上升趋势,而新客户有一点下降趋势。
看一下新客户比例
#create a dataframe that shows new user ratio - we also need to drop NA values (first month new user ratio is 0) tx_user_ratio = tx_uk.query("UserType == 'New'").groupby(['InvoiceYearMonth'])['CustomerID'].nunique()/tx_uk.query("UserType == 'Existing'").groupby(['InvoiceYearMonth'])['CustomerID'].nunique() tx_user_ratio = tx_user_ratio.reset_index() tx_user_ratio = tx_user_ratio.dropna() #print the dafaframe tx_user_ratio #plot the result plot_data = [ go.Bar( x=tx_user_ratio.query("InvoiceYearMonth>201101 and InvoiceYearMonth<201112")['InvoiceYearMonth'], y=tx_user_ratio.query("InvoiceYearMonth>201101 and InvoiceYearMonth<201112")['CustomerID'], ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='New Customer Ratio' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
如同所料,新客户比例在下降。到了11月份下降到了20%。
我们需要仔细观察分析这个指标,因为它说明我们提供的商品和服务对客户的粘性以及市场适合情况。
푀표푛푡ℎ푙푦 푅푒푡푒푛푡푖표푛 푅푎푡푖표=上个月的留存客户/总的活跃用户
#identify which users are active by looking at their revenue per month tx_user_purchase = tx_uk.groupby(['CustomerID','InvoiceYearMonth'])['Revenue'].sum().reset_index() #create retention matrix with crosstab tx_retention = pd.crosstab(tx_user_purchase['CustomerID'], tx_user_purchase['InvoiceYearMonth']).reset_index() #create an array of dictionary which keeps Retained & Total User count for each month months = tx_retention.columns[2:] retention_array = [] for i in range(len(months)-1): retention_data = {} selected_month = months[i+1] prev_month = months[i] retention_data['InvoiceYearMonth'] = int(selected_month) retention_data['TotalUserCount'] = tx_retention[selected_month].sum() retention_data['RetainedUserCount'] = tx_retention[(tx_retention[selected_month]>0) & (tx_retention[prev_month]>0)][selected_month].sum() retention_array.append(retention_data) #convert the array to dataframe and calculate Retention Rate tx_retention = pd.Dataframe(retention_array) tx_retention['RetentionRate'] = tx_retention['RetainedUserCount']/tx_retention['TotalUserCount'] #plot the retention rate graph plot_data = [ go.Scatter( x=tx_retention.query("InvoiceYearMonth<201112")['InvoiceYearMonth'], y=tx_retention.query("InvoiceYearMonth<201112")['RetentionRate'], name="organic" ) ] plot_layout = go.Layout( xaxis={"type": "category"}, title='Monthly Retention Rate' ) fig = go.Figure(data=plot_data, layout=plot_layout) pyoff.iplot(fig)
tx_retention.head()
留存率在六月和八月到达高峰,然后又下降了。
另外一个较为重要的衡量留存的指标是Cohort based retention ratio, 笔者将其译为同批留存率。这里的Cohort定义为第一次购买发生在相同年月的客户,称为同批客户。同批留存率衡量每个月的同批客户的留存率,通过这个指标,我们可以了解老客户和新客户留存的不同,还有是否最近的改变对客户体验有所影响。
#create our retention table again with crosstab() and add firs purchase year month view tx_retention = pd.crosstab(tx_user_purchase['CustomerID'], tx_user_purchase['InvoiceYearMonth']).reset_index() tx_retention = pd.merge(tx_retention,tx_min_purchase[['CustomerID','MinPurchaseYearMonth']],on='CustomerID') new_column_names = [ 'm_' + str(column) for column in tx_retention.columns[:-1]] new_column_names.append('MinPurchaseYearMonth') tx_retention.columns = new_column_names #create the array of Retained users for each cohort monthly retention_array = [] for i in range(len(months)): retention_data = {} selected_month = months[i] prev_months = months[:i] next_months = months[i+1:] for prev_month in prev_months: retention_data[prev_month] = np.nan total_user_count = tx_retention[tx_retention.MinPurchaseYearMonth == selected_month].MinPurchaseYearMonth.count() retention_data['TotalUserCount'] = total_user_count retention_data[selected_month] = 1 query = "MinPurchaseYearMonth == {}".format(selected_month) for next_month in next_months: new_query = query + " and {} > 0".format(str('m_' + str(next_month))) retention_data[next_month] = np.round(tx_retention.query(new_query)['m_' + str(next_month)].sum()/total_user_count,2) retention_array.append(retention_data) tx_retention = pd.Dataframe(retention_array) tx_retention.index = months #showing new cohort based retention table tx_retention
从数据上看,第一月的那批客户的留存率是最高的。
未完,待续…
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)