what is the bootstrapped data in data mining?
recently I came 开发者_如何学Pythonacross this term,but really have no idea what it refers to.I've searched online,but with little gain. Thanks.
Take a sample of the time of day that you wake up on Saturdays. Some Friday nights you have a few too many drinks, so you wake up early (but go back to bed). Other days you wake up at a normal time. Other days you sleep in.
Here are the results:
[3.1, 4.8, 6.3, 6.4, 6.6, 7.3, 7.5, 7.7, 7.9, 10.1]
What is the mean time that you wake up?
Well it's 6.8 (o'clock, or 6:48). A touch early for me.
How good a prediction is this of when you'll wake up next Saturday? Can you quantify how wrong you are likely to be?
It's a pretty small sample, and we're not sure of the distribution of the underlying process, so it might not be a good idea to use standard parametric statistical techniques†.
Why don't we take a random sample of our sample, and calculate the mean and repeat this? This will give us an estimate of how bad our estimate is.
I did this several times, and the mean was between 5.98 and 7.8
This is called the bootstrap, and it was first mentioned by Bradley Efron in 1979.
A variant is called the jackknife, where you sample all but one of your dataset, take the mean, and repeat. The jackknife mean is 6.8 (same as the arithmetic mean) and ranges from 6.4 to 7.2.
Another variant is called k-fold cross-validation, where you (at random) split your data set into k equally-sized sections, calculate the mean of all but one section, and repeat k times. The 5-fold cross-validation mean is 6.8 and ranges from 4 to 9.
† This distribution does happen to be Normal. The 95% confidence interval of the mean is 5.43 to 8.11, reasonably close but bigger than the bootstrap mean.
If you don't have enough data to train your algorithm you can increase the size of your training set by (uniformly) randomly selecting items and duplicating them (with replacement).
In machine learning bootstrapping is iterative training on a known set. http://en.wikipedia.org/wiki/Bootstrapping_(machine_learning)
精彩评论