当计算分位点的时候,离散分布带给我们一个问题:我们从一个连续的浮点值变量(real-valued variable)开始-概率-但结果(随机变量的值)应当是真正的离散的。
考虑一个二项分布,样本大小为50,成功分数(success fraction)为05。我们有多种方式来为一个离散分布作图,但是如果我们把函数PDF作为一个阶梯函数(step-function)来作图,那么函数图像看起来像下面这样:
现在假设用户要求一个概率为005的分位点,如果我们在函数CDF上放大这个区域,我们将看到下面的图像:
就像可以看到的那样,并没有一个随机变量的概率对应于005,因此,就像在图像中显示的那样,我们有两个选择:
我们可以将结果向下舍入到18。
我们可以将结果向上舍入到19。
实际上还有第三个选择:我们可以“假装”这个分布是连续分布并返回一个浮点值:在这种情况下,我们可以计算一个近似到18701的结果(这就可以精确地反映出结果更接近于19而不是18)。
通过使用策略我们可以作出上面的任何一个选择,但这仍留给我们一个问题:实际上应该怎么做呢
特殊一些:我们应当使用哪个策略作为缺省呢
在回答问题之前我们应当意识到:
计算一个整数结果通常要比计算一个浮点数结果要更快:实际上在我们的测试中要快20倍。
通常人们计算分位来进行一些测试:"如果随机变量的值小于N,那么我们可以具有90%的把握来否决我们的虚假设(null-hypothesis)"
因此计算一个整数结果是一个很大的益处,并且从哲学的观点看,这也是“正确的做法”。更进一步,如果有人需要一个在概率为005处的分位点,那么我们通常可以假设他需要一个至少 95%的概率选择右边的一个值,且至多 5%的概率选择左边的值。
在上面的例子中,因此我们应当把结果向下舍入到18。
这种情况的反面应用于上分位点(upper-quantiles):如果概率大于05,那么我们想要将结果向上舍入,使得至少要求的概率是在返回值的概率的左边,且在返回值概率右边的概率不超过1 - 要求的概率。
类似的,对于两侧区间( two-sided intervals),我们将会把下分位点(lower quantile)向下舍入,将上分位点(upper quantile)向上舍入。这确保在区间的中间区域我们至少有要求的概率并且在尾部区域的概率不超过1-要求的概率。
例如,使用样本大小为50且成功分数(success fraction)为 05的二项分布作为例子,如果我们想要一个双侧的(two-sided)的90%置信区间,那么我们需要结果向外舍入的005 和 095 的分位点,使得至少90%的概率在中间区域。
直到目前为止还可以,但实际上有一个因为不小心面导致的陷阱:
quantile(binomial(50, 05), 005);
返回18 作为结果,这也是我们从上面的图像中期待的结果,但实际上没有大于18的x值使得:
cdf(binomial(50, 05), x) <= 005;
然而:
quantile(binomial(50, 05), 095);
返回 31,但实际上并没有小于31的x值使得:
cdf(binomial(50, 05), x) >= 095;
我们可能简单地希望对于这些对称分布,结果将会是 32 (因为 32 = 50 - 18),但是我们需要记住二项分布的cdf函数是包含(inclusive) 随机变量的。因此当左边的区域包含 返回的分位点时,右边的区域就排除了上分位点值:因为它属于中心区域。
查看上面的图像来观察在这里会发生什么:下分们点18属于左边的尾部区域,因此任何的小于等于18的值都在左边的尾部区域中。上分位点值31属于中心区域,因此尾部区域实际上从32开始,因此任何大于31的值都在右边的尾部区域中。
因此,如果 U 和 L 分别是上分位点和下分位点,那么随机变量X位于这个尾部区域-如果
X <= L || X > U
那么我们将否决这个虚假设。
如果:
L < X <= U
那么X位于中心区域。
这里的寓意是(moral)当处理离散分布时要时刻小心你所进行的 *** 作, 如果不能肯定,那么将你所进行的比较基于函数CDF。
其它可用的舍入策略
就像你从策略这部分所期望的那样,你不会惊讶还有其它的可用的策略:
integer_round_outwards
这是缺省的策略,就像上面描述的那样:下分位点(概率小于05)向下舍入,上分位点(概率大于05)向上舍入。
这就可以在尾部区域给定不超过要求的概率,在中心区域给定至少是要求的概率。
integer_round_inwards
这个策略与缺省的策略是完全相反的:下分位点(概率小于05)向上舍入,上分位点(概率大于05)向下舍入。
这就可以在尾部区域给定至少是要求的概率,在中心区域给定不超过要求的概率。
integer_round_down
这个策略将结果向下舍入而不管是上分位点还是下分位点。
integer_round_up
这个策略将结果向上舍入而不管是上分位点还是下分位点。
integer_round_nearest
这个策略将会把结果舍入到最接近的整数而不管是上分位点还是下分位点。
real
这个策略将会从离散分布的分位点中返回浮点值:这通常要比查找一个整数结果更慢一些,但是却允许更灵活的舍入策略。
为了理解舍入策略是如何应用于离散分布的,我们将使用一个样本大小为50,成功分数(success fraction)为05的二项分布作为例子,并计算在005和095处的所有可能的分位点。
包含一些需要的头文件:
#include <iostream>
#include <boost/math/distributions/binomialhpp>
接下来我们将需要的声明引入作用域 ,并针对所有的可用的舍入策略定义分布类型:
using namespace boost::math::policies;
using namespace boost::math;
typedef binomial_distribution<
double,
policy<discrete_quantile<integer_round_outwards> > >
binom_round_outwards;
typedef binomial_distribution<
double,
policy<discrete_quantile<integer_round_inwards> > >
binom_round_inwards;
typedef binomial_distribution<
double,
policy<discrete_quantile<integer_round_down> > >
binom_round_down;
typedef binomial_distribution<
double,
policy<discrete_quantile<integer_round_up> > >
binom_round_up;
typedef binomial_distribution<
double,
policy<discrete_quantile<integer_round_nearest> > >
binom_round_nearest;
typedef binomial_distribution<
double,
policy<discrete_quantile<real> > >
binom_real_quantile;
现在让我们调用这些分位点函数:
int main()
{
std::cout <<
"Testing rounding policies for a 50 sample binomial distribution,\n"
"with a success fraction of 05\n\n"
"Lower quantiles are calculated at p = 005\n\n"
"Upper quantiles at p = 095\n\n";
std::cout << std::setw(25) << std::right
<< "Policy"<< std::setw(18) << std::right
<< "Lower Quantile" << std::setw(18) << std::right
<< "Upper Quantile" << std::endl;
// 测试 integer_round_outwards:
std::cout << std::setw(25) << std::right
<< "integer_round_outwards"
<< std::setw(18) << std::right
<< quantile(binom_round_outwards(50, 05), 005)
<< std::setw(18) << std::right
<< quantile(binom_round_outwards(50, 05), 095)
<< std::endl;
// 测试 integer_round_inwards:
std::cout << std::setw(25) << std::right
<< "integer_round_inwards"
<< std::setw(18) << std::right
<< quantile(binom_round_inwards(50, 05), 005)
<< std::setw(18) << std::right
<< quantile(binom_round_inwards(50, 05), 095)
<< std::endl;
// 测试 integer_round_down:
std::cout << std::setw(25) << std::right
<< "integer_round_down"
<< std::setw(18) << std::right
<< quantile(binom_round_down(50, 05), 005)
<< std::setw(18) << std::right
<< quantile(binom_round_down(50, 05), 095)
<< std::endl;
// 测试 integer_round_up:
std::cout << std::setw(25) << std::right
<< "integer_round_up"
<< std::setw(18) << std::right
<< quantile(binom_round_up(50, 05), 005)
<< std::setw(18) << std::right
<< quantile(binom_round_up(50, 05), 095)
<< std::endl;
// 测试 integer_round_nearest:
std::cout << std::setw(25) << std::right
<< "integer_round_nearest"
<< std::setw(18) << std::right
<< quantile(binom_round_nearest(50, 05), 005)
<< std::setw(18) << std::right
<< quantile(binom_round_nearest(50, 05), 095)
<< std::endl;
// 测试 real:
std::cout << std::setw(25) << std::right
<< "real"
<< std::setw(18) << std::right
<< quantile(binom_real_quantile(50, 05), 005)
<< std::setw(18) << std::right
<< quantile(binom_real_quantile(50, 05), 095)
<< std::endl;
}
程序产生下面的输出:
Testing rounding policies for a 50 sample binomial distribution,
with a success fraction of 05
Lower quantiles are calculated at p = 005
Upper quantiles at p = 095
Testing rounding policies for a 50 sample binomial distribution,
with a success fraction of 05
Lower quantiles are calculated at p = 005
Upper quantiles at p = 095
Policy Lower Quantile Upper Quantile
integer_round_outwards 18 31
integer_round_inwards 19 30
integer_round_down 18 30
integer_round_up 19 31
integer_round_nearest 19 30
real 18701 30299
Copyright 2006 , 2007, 2008 John Maddock, Paul A Bristow, Hubert Holin, Xiaogang Zhang and Bruno Lalande
Distributed under the Boost Software License, Version 10 (See accompanying file LICENSE_1_0txt or copy at >
Learn about probability distributions while analyzing bikesharing data
Find the probability of there being more than 5000 riders in a single day (using the cnt column)
Assign the result to prob_over_5000
we know that the probability is about 39 that there are more than 5000 riders in a single day
we figured out how to find the probability of k outcomes out of N events occurring
mission was:
(pk∗qN−k) ∗ N! / k!(N−k)!
create a function that can compute the probability of k outcomes out of N events occurring
Use the function to find the probability of each number of outcomes in outcome_counts occurring
An outcome is a day where there are more than 5000 riders, with p=39
You should have a list with 31 items, where the first item is the probability of 0 days out of 30 with more than 5000 riders, the second is the probability of 1 day out of 30, and so on, up to 30 days out of 30
Assign the list to outcome_probs
You may have noticed that outcome_counts in the previous screen was 31 items long when N was only 30 This is because we need to account for 0 There's a chance of having k=0 , where the outcome we want doesn't happen at all This data point needs to be on our charts We'll always want to add 1 to N when figuring out how many points to plot
We can instead use the binompmf function from SciPy to do this faster
Here's a usage example:
The pmf function in SciPy is an implementation of the mathematical probability mass function The pmf will give us the probability of each k in our outcome_counts list occurring
A binomial distribution only needs two parameters A parameter is the statistical term for a number that summarizes data for the entire population For a binomial distribution, the parameters are:
The SciPy function pmf matches this and takes in the following parameters:
x: the list of outcomes,
n: the total number of events,
p: the probability of the outcome we're interested in seeing
Because we only need two parameters to describe a distribution, it doesn't matter whether we want to know if it will be sunny 5 days out of 5, or if 5 out of 5 coin flips will turn up heads As long as the outcome that we care about has the same probability (p) , and N is the same, the binomial distribution will look the same
If we repeatedly look at 30 days of bikesharing data, we'll find that 10 of the days had more than 5000 riders about 124% of the time We'll find that 12 of the days had more than 5000 riders about 146% of the time
Sometimes we'll want to be able to tell people the expected value of a probability distribution -- the most likely result of a single sample that we look at
To compute this, we just multiply N by p
Just as we can compute the mean, we can also compute the standard deviation of a probability distribution This helps us find how much the actual values will vary from the mean when we take a sample
Going back to the bikesharing example, we know that the actual values will be around 117 (from the last screen) But, we'll need a standard deviation to explain how much the actual values can vary from this expectation
the formula for standard deviation of a probability distribution is
Just like we did with histograms and sampling a few missions ago, we can vary the parameters to change the distribution Let's see what the plot would look like with only 10 events, or 100 events
From the last screen, the more events we looked at, the closer our distribution was to being normal With N=10, we saw some rightward skew, but when we got up to N=100, the skew disappeared
This is because the distribution got narrower relative to the x-axis range the more examples you add With N=10, there's a reasonable chance that 8 to 10 days could have over 5000 riders But, when we get up to N=100, it's statistically almost impossible that more than 60 days have over 5000 riders This makes the distribution narrower
As the distribution gets narrower, it gets more similar to the normal distribution In the code cell, we plot a line chart instead of a bar chart and it looks almost exactly like a normal distribution
So far, we've looked at the probability that single values of k will occur What we can look at instead is the probability that k or less will occur These probabilities can be generated by the cumulative density function
Let's say we flip a coin 3 times -- N=3, and p=5 When this happens, here are the probabilities:
A cumulative distribution would look like this:
For each k, we fill in the probability that we'll see k outcomes or less By the end of the distribution, we should get 1, because all the probabilities add to 1 (if we flip 3 coins, either 0, 1, 2, or 3 of them must be heads)
We can calculate this with binomcdf in scipy
This is because every normal distribution, as we learned in an earlier mission, has the same properties when it comes to what percentage of the data is within a certain number of standard deviations of the mean You can look these up in a standard normal table About 68% of the data is within 1 standard deviation of the mean, 95% is within 2, and 99% is within 3
We can calculate the mean (μ) and standard deviation (σ) of a binomial probability distribution using the formulas from earlier:
If we want to figure out the number of standard deviations from the mean a value is, we just do:
(k-μ)/σ
If we wanted to find how many standard deviations from the mean 16 days is:
This tells us that 16 days is approximately 161 standard deviations above the mean In percentage form, this captures about 4463% of our data If we also include 161 standard deviations below the mean(both sides of distribution), this'll include 8926% of our data
There's a 537% chance that a value is 161 standard deviations or more above the mean (to the right of the mean), and there's a 537% chance that a value is 161 standard deviations below the mean (to the left)
We don't want to have to use a z-score table every time we want to see how likely or unlikely a probability is A much faster way is to use the cumulative distribution fuction (cdf) like we did earlier This won't give us the exact same values as using z-scores, because the distribution isn't exactly normal, but it will give us the actual amount of probability in a distribution to the left of a given k
To use it, we can run:
This will return the sum of all the probabilities to the left of and including k If we subtract this value from 1, we get the sum of all the probabilities to the right of k
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)