We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4
So here are the kernels, the US women.
So there's lots of US women.
So, I could work with quite small bins. And it looks fine. It's interesting. We can say something about maybe the mode of the distribution is about 160 centimeters. But, after a while, the little bumps here become somewhat annoying. They are clearly not representing what's there in reality. So we might want something smoother. And one way to get something smoother is to place-- instead of having a histogram, representing what's called a "kernel." So let you show you these data. Let me show you the same data. I've plotted now the histogram, but without color, so we can see the kernel. And the kernel is this function. So the kernel density estimation is, in a sense, a smoothed histogram. So how does this smoothing happen? Remember that we said the histogram, in order to do the histogram, I take in a particular interval, I basically stack up little vertical bins for each observation, proportional to the number of observations that I have. For the histogram, we do the same thing, except that, instead of stacking little rectangular bars, we sum up the result from a kernel function. So let me first show it to graphically, and then I'll put down the formula. So suppose that this is my sample. I have a sample which has only 10 observations. And these are my observations. These are the value of the observations. How do I do a kernel density estimation? Around each observation, I'm going to draw a curve that we are going to call a "kernel," which was is where the "kernel density estimation" come from-- this kernel. What's a kernel? What do you think of this blue curve? How do they look like? A normal distribution? It could be a normal, or it could be-- doesn't even have to be a normal. What's relevant-- Oh, Gaussian. Gaussian and normal are friends. That's the same term. But what's-- yeah. They're symmetrical and centered on the point. Exactly. What they need to be is they need to be symmetrical and centered on the point. So a kernel is fine. A normal is fine. This one is a normal, as it tells you on the top. But actually, like, a little U-shape, inverted U-shape, on top of it would work just fine. So any distribution that is symmetrical and we centered around the point will do. And it has to integrate to 1. So, we do all these curves. We do all these bells, normal-looking bells, or Epanechnikov. Epanechnikov is kind of more of a rounder bell, like that. Doesn't really matter. That's a choice, the size of the bell, the shape of the bell, but it's not a choice that turns out to be deeply important. We do that for each of the points. OK? And then we take a bin. In the case of kernel density estimation, we are going to call that a "bandwidth"-- like the width of the band. And then, suppose I'm interested in estimating the kernel-density function for this guy, this point. I'm doing my band. Here, in this case, we know that the bandwidth is 0.678. So I'm drawing a little band of 0.678, which is around-- so that's about, if this is 1, this is about that. I'm drawing it around the point where I'm interested to estimate. I'm drawing it around my x. And then I'm counting, I'm summing up all of the height of the curves for the point that-- for the points that fall within this bin. So, for example, here, when I draw this, I'm getting this one, this one, this one, this one, this one, roughly these ones. So I'm going to sum up, at the point x, the height of each of these curves. So it's very similar to do a histogram, except that, in an histogram, at a point I stack rectangles of the same height, and here I'm going to stack little bars of different height, giving them smaller height if they are far from my point and larger height if they are close to my point. Does that make sense? So, if you see, at the very, very, very edge of it, the histogram is just-- for this point, here, there is almost-- there is almost only the point from this histogram, in here, so I'm very close to the curve-- to the first one. And then I'm kind of moving up from it, because, for all of these points that are here I'm adding a lot of histogram. I'm adding a lot of kernels, so the vertical height is higher. At this point, here, corresponds to a point that is above here, with an histogram of 68. You can see that there is a lot of kernels to add. Make sense? Yep. So, for a certain bandwidth, when you sum up all the heights of the little curves in that bandwidth, then that final height value, does it get plotted at the beginning of that bandwidth? In the middle. Oh. In the middle. So, basically, you draw the bandwidth. If you want to plot this particular point, you draw the bandwidth centered around that plot, and you sum up all of the kernels that show up in that interval, and that gives you the value of that point. OK? Now, concretely, you don't actually do that. But that's what the-- R does that. So that's what this function tells us. I think it's useful to go to the graphical representation of what this function tells us. But basically it tells us it's-- for any-- you know, if you have a sample x1, x2, to xn, an independent and identically sample drawn from some distribution with an unknown PDF that you are trying to get some sense of. We are interested-- The kernel density estimator is given by, for any point x, the weighted sum of the kernel function of x minus xi. So basically it's this weighted sum of all of the function. So it gives us something which is quite similar to an histogram. But within each of the bin is giving more weight in our counts to the number of observations-- to the observations that are closer to the center of the integral. And we redivide by n. And the size of each. Yep? So how-- I guess, how accurate is it to assume that it's identically distributed? And, like, what can you do about a sample size that's not identically distributed? So, in this particular case, this is what it is.
That's the assumption we are making.
Whether it's a good assumption or a bad assumption, that is going to depend on the data set. But, when I draw this assumption we are making, for it to even make sense to start drawing a kernel distribution. So, typically, a sample of observation, it's reasonable to think that a sample observation representing heights, pretty reasonable to think that it's an iid sample. These are people who are-- It might be an iid sample from a funky distribution. For example, if I have men and women, there is a distribution that represents the size of men and women, but I could instead say, well, this distribution is really the combination of two distributions, which is one distribution for the men, one distribution for the women. But there is still a distribution that represents where this sample is coming from. Can you weight kernels different, in a way, to distribute your-- like, find a probability [INAUDIBLE]?? No, you cannot. Because, remember, you have no idea, at this point, what the distribution is. So what you're trying to do, here, is to do, say, this is my sample. It's coming from this-- This is my sample. I'm assuming that it's an iid sample drawn from some distribution. I want to look at the shape of this distribution. I make no assumption-- that's the value of a kernel. I make no assumption about what the distribution looks like. So you can see that, here, there is actually a bump, here. It's not a-- this definitely doesn't look, for example, like a normal distribution. I make zero assumption of what the distribution might be. The kernel is going to tell me what it might be.