Русские видео

Сейчас в тренде

Иностранные видео


Скачать с ютуб Machine Learning with R Tutorial: How kmeans() works and practical matters в хорошем качестве

Machine Learning with R Tutorial: How kmeans() works and practical matters 8 лет назад


Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса savevideohd.ru



Machine Learning with R Tutorial: How kmeans() works and practical matters

Make sure to like & comment if you enjoy this video! This is the third video for our course Unsupervised Learning in R by Hank Roark. Take Hank's course here: https://www.datacamp.com/courses/unsu... In this section I am going to help build intuition about how ‘kmeans’ works internally. My goal is to do this through visual understanding; if you are interested in the mathematics, there are many sources available on the web and in print. After that, I will present methods for determining the number of subgroups, or clusters, if that is not known beforehand. Here is data with 2 features. I know that the data for this sample is originally from two subgroups. The first step in the ‘kmeans’ algorithm is to randomly assign each point to one of the two clusters. This is the random aspect of the kmeans algorithm. Cluster one is represented by empty green circles and cluster two is represented by empty blue triangles. The next step of ‘kmeans’ is to calculate the centers of each of the two subgroups. The centers of each subgroup is the average position of all the points in that subgroup. The center for each subgroup is shown as the solid green circle and the solid blue triangle for subgroups 1 and 2 respectfully. Next, each point in the data is assigned to the cluster of the nearest center. Here, you can see that all the points closest to the solid blue triangle center have been assigned to that cluster. The equivalent is true for the other subgroup. This completes one iteration of the ‘kmeans’ algorithm. The ‘kmeans’ algorithm will finish when no points change assignment. In this case, many points change cluster assignment, so another iteration will be completed. Here, we see the kmeans algorithm after completion of 2 iterations. New cluster centers have been calculated and each observation has been assigned to the cluster of the nearest center. And here is the algorithm after completion of three iterations. Again some points have changed cluster assignments so another iteration of algorithm will complete. And this is after completion of the fourth iteration. The algorithm is completed after the fifth iteration. No observations have changed assignment from the end of the fourth to the end of this iteration so the ‘kmeans’ algorithm stops. This final plot thus shows the cluster assignments for each observation and the cluster centers for each of the two clusters. There are other stopping criteria that you can specify for the ‘kmeans’ algorithm, such as stopping after some number of iterations or if the cluster centers move less than some distance. Because kmeans has a random component, it is run multiple times and the best solution is selected from the multiple runs. The ‘kmeans’ algorithm needs a measurement of model quality to determine the ‘best’ outcome of multiple runs. ‘kmeans’ in R uses the total within cluster sum of squares as that measurement. The ‘kmeans’ run with the minimum total within cluster sum of squares is considered the best model. Total within cluster sum of squares is easy to calculate — for each cluster in the model and for each observation assigned to that cluster, calculate the squared distance from the observation to the cluster center — this is just the squared Euclidean distance from plane geometry class. Sum all of the squared distances calculated and that is the total within cluster sum of squares. R does all of this model selection automatically. By specifying ‘nstart’ in kmeans, the algorithm will be run ‘nstart’ times and the run with the lowest total within cluster sum of squares will be the resulting model. This helps the algorithm find a global minimum instead of a local minimum, but does not guarantee that outcome. In the hands-on exercises I will show you how to determine the total within cluster sum of squares from the results of running kmeans. Here is a visual example of running the ‘kmeans’ algorithm on the same data multiple times. In this case it is known that there are three clusters within the data. The graph on the top right has lowest total within cluster sum of squares. Another item of note — cluster membership is color-coded in these plots, notice the even between runs that find approximately same solution that the cluster memberships are assigned differently — this is not a big deal, just a result of the ‘kmeans’ algorithm that you should keep in mind. For repeatability, use R's set.seed() function before running ‘kmeans’ to guarantee reproducibility. If you don't know the number of subgroups within the data beforehand, there is a way to heuristically determine the number of clusters. You could use trial and error, but instead the best approach is to run ‘kmeans’ with 1 through some number of clusters, recording the total within cluster sum of squares for each number of clusters.

Comments