3 @brief Cluster analysis algorithm: Genetic clustering algorithm (GA).
4 @details Implementation based on papers @cite article::ga::1, @cite article::ga::2.
6 @authors Aleksey Kukushkin, Andrei Novikov (pyclustering@yandex.ru)
8 @copyright BSD-3-Clause
15 import matplotlib.pyplot
as plt
16 import matplotlib.animation
as animation
19 from pyclustering.cluster.ga_maths
import ga_math
25 @brief Genetic algorithm observer that is used to collect information about clustering process on each iteration.
29 def __init__(self, need_global_best=False, need_population_best=False, need_mean_ff=False):
31 @brief Constructs genetic algorithm observer to collect specific information.
33 @param[in] need_global_best (bool): If 'True' then the best chromosomes and its fitness function value (global optimum) for each iteration are stored.
34 @param[in] need_population_best (bool): If 'True' then current (on each iteration) best chromosomes and its fitness function value (local optimum) are stored.
35 @param[in] need_mean_ff (bool): If 'True' then average value of fitness function on each iteration is stored.
56 @brief Returns amount of iterations that genetic algorithm was observed.
63 return max(global_length, local_length, average_length)
68 @brief Stores the best chromosome and its fitness function's value.
70 @param[in] best_chromosome (list): The best chromosome that were observed.
71 @param[in] best_fitness_function (float): Fitness function value of the best chromosome.
84 @brief Stores the best chromosome for current specific iteration and its fitness function's value.
86 @param[in] best_chromosome (list): The best chromosome on specific iteration.
87 @param[in] best_fitness_function (float): Fitness function value of the chromosome.
100 @brief Stores average value of fitness function among chromosomes on specific iteration.
102 @param[in] fitness_functions (float): Average value of fitness functions among chromosomes.
114 @return (dict) Returns dictionary with keys 'chromosome' and 'fitness_function' where evolution of the best chromosome
115 and its fitness function's value (evolution of global optimum) are stored in lists.
123 @brief (dict) Returns dictionary with keys 'chromosome' and 'fitness_function' where evolution of the current best chromosome
124 and its fitness function's value (evolution of local optimum) are stored in lists.
132 @brief (list) Returns fitness function's values on each iteration.
141 @brief Genetic algorithm visualizer is used to show clustering results that are specific for
142 this particular algorithm: clusters, evolution of global and local optimum.
143 @details The visualizer requires 'ga_observer' that collects evolution of clustering process in
144 genetic algorithm. The observer is created by user and passed to genetic algorithm. There
145 is usage example of the visualizer using the observer:
147 from pyclustering.cluster.ga import genetic_algorithm, ga_observer, ga_visualizer
148 from pyclustering.utils import read_sample
149 from pyclustering.samples.definitions import SIMPLE_SAMPLES
152 # Read data for clustering
153 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE1)
155 # Create instance of observer that will collect all information:
156 observer_instance = ga_observer(True, True, True)
158 # Create genetic algorithm where observer will collect information:
159 ga_instance = genetic_algorithm(data=sample,
163 count_mutation_gens=1,
164 observer=observer_instance)
167 ga_instance.process()
170 clusters = ga_instance.get_clusters()
172 # Print cluster to console
173 print("Amount of clusters: '%d'. Clusters: '%s'" % (len(clusters), clusters))
175 # Show cluster using observer:
176 ga_visualizer.show_clusters(sample, observer_instance)
179 @see cluster_visualizer
184 def show_evolution(observer, start_iteration=0, stop_iteration=None, ax=None, display=True):
186 @brief Displays evolution of fitness function for the best chromosome, for the current best chromosome and
187 average value among all chromosomes.
189 @param[in] observer (ga_observer): Genetic algorithm observer that was used for collecting evolution in the algorithm and
190 where whole required information for visualization is stored.
191 @param[in] start_iteration (uint): Iteration from that evolution should be shown.
192 @param[in] stop_iteration (uint): Iteration after that evolution shouldn't be shown.
193 @param[in] ax (Axes): Canvas where evolution should be displayed.
194 @param[in] display (bool): If 'True' then visualization of the evolution will be shown by the function.
195 This argument should be 'False' if you want to add something else to the canvas and display it later.
197 @return (Axis) Canvas where evolution was shown.
202 _, ax = plt.subplots(1)
203 ax.set_title(
"Evolution")
205 if stop_iteration
is None:
206 stop_iteration = len(observer)
208 line_best, = ax.plot(observer.get_global_best()[
'fitness_function'][start_iteration:stop_iteration],
'r')
209 line_current, = ax.plot(observer.get_population_best()[
'fitness_function'][start_iteration:stop_iteration],
'k')
210 line_mean, = ax.plot(observer.get_mean_fitness_function()[start_iteration:stop_iteration],
'c')
212 if start_iteration < (stop_iteration - 1):
213 ax.set_xlim([start_iteration, (stop_iteration - 1)])
215 ax.set_xlabel(
"Iteration")
216 ax.set_ylabel(
"Fitness function")
217 ax.legend([line_best, line_current, line_mean], [
"The best pop.",
"Cur. best pop.",
"Average"], prop={
'size': 10})
229 @brief Shows allocated clusters by the genetic algorithm.
231 @param[in] data (list): Input data that was used for clustering process by the algorithm.
232 @param[in] observer (ga_observer): Observer that was used for collection information about clustering process.
233 @param[in] marker (char): Type of marker that should be used for object (point) representation.
234 @param[in] markersize (uint): Size of the marker that is used for object (point) representation.
236 @note If you have clusters instead of observer then 'cluster_visualizer' can be used for visualization purposes.
238 @see cluster_visualizer
242 figure = plt.figure()
243 ax1 = figure.add_subplot(121)
245 clusters = ga_math.get_clusters_representation(observer.get_global_best()[
'chromosome'][-1])
248 visualizer.append_clusters(clusters, data, 0, marker, markersize)
249 visualizer.show(figure, display=
False)
251 ga_visualizer.show_evolution(observer, 0,
None, ax1,
True)
257 @brief Animate clustering process of genetic clustering algorithm.
258 @details This method can be also used for rendering movie of clustering process and 'ffmpeg' is required for that purpuse.
260 @param[in] data (list): Input data that was used for clustering process by the algorithm.
261 @param[in] observer (ga_observer): Observer that was used for collection information about clustering process.
262 Be sure that whole information was collected by the observer.
263 @param[in] animation_velocity (uint): Interval between frames in milliseconds (for run-time animation only).
264 @param[in] movie_fps (uint): Defines frames per second (for rendering movie only).
265 @param[in] save_movie (string): If it is specified then animation will be stored to file that is specified in this parameter.
269 figure = plt.figure()
272 return frame_generation(0)
274 def frame_generation(index_iteration):
277 figure.suptitle(
"Clustering genetic algorithm (iteration: " + str(index_iteration) +
")", fontsize=18, fontweight=
'bold')
279 visualizer =
cluster_visualizer(4, 2, [
"The best pop. on step #" + str(index_iteration),
"The best population"])
281 local_minimum_clusters = ga_math.get_clusters_representation(observer.get_population_best()[
'chromosome'][index_iteration])
282 visualizer.append_clusters(local_minimum_clusters, data, 0)
284 global_minimum_clusters = ga_math.get_clusters_representation(observer.get_global_best()[
'chromosome'][index_iteration])
285 visualizer.append_clusters(global_minimum_clusters, data, 1)
287 ax1 = plt.subplot2grid((2, 2), (1, 0), colspan=2)
288 ga_visualizer.show_evolution(observer, 0, index_iteration + 1, ax1,
False)
290 visualizer.show(figure, shift=0, display=
False)
291 figure.subplots_adjust(top=0.85)
293 return [figure.gca()]
295 iterations = len(observer)
296 cluster_animation = animation.FuncAnimation(figure, frame_generation, iterations, interval=animation_velocity, init_func=init_frame, repeat_delay=5000)
298 if save_movie
is not None:
299 cluster_animation.save(save_movie, writer=
'ffmpeg', fps=movie_fps, bitrate=1500)
306 @brief Class represents Genetic clustering algorithm.
307 @details The searching capability of genetic algorithms is exploited in order to search for appropriate
310 Example of clustering using genetic algorithm:
312 from pyclustering.cluster.ga import genetic_algorithm, ga_observer
313 from pyclustering.utils import read_sample
314 from pyclustering.samples.definitions import SIMPLE_SAMPLES
317 # Read input data for clustering
318 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE4)
320 # Create instance of observer that will collect all information:
321 observer_instance = ga_observer(True, True, True)
323 # Create genetic algorithm for clustering
324 ga_instance = genetic_algorithm(data=sample,
326 chromosome_count=100,
327 population_count=200,
328 count_mutation_gens=1)
331 ga_instance.process()
334 clusters = ga_instance.get_clusters()
336 # Print cluster to console
337 print("Amount of clusters: '%d'. Clusters: '%s'" % (len(clusters), clusters))
340 There is an example of clustering results (fitness function evolution and allocated clusters) that were
341 visualized by 'ga_visualizer':
343 @image html ga_clustering_sample_simple_04.png
350 def __init__(self, data, count_clusters, chromosome_count, population_count, **kwargs):
352 @brief Initialize genetic clustering algorithm.
354 @param[in] data (numpy.array|list): Input data for clustering that is represented by two dimensional array
355 where each row is a point, for example, [[0.0, 2.1], [0.1, 2.0], [-0.2, 2.4]].
356 @param[in] count_clusters (uint): The amount of clusters that should be allocated in the data.
357 @param[in] chromosome_count (uint): The amount of chromosomes in each population.
358 @param[in] population_count (uint): The amount of populations that essentially defines the amount of iterations.
359 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: `count_mutation_gens`,
360 `coeff_mutation_count`, `select_coeff`, `crossover_rate`, `observer`, `random_state`).
362 <b>Keyword Args:</b><br>
363 - count_mutation_gens (uint): Amount of genes in chromosome that is mutated on each step.
364 - coeff_mutation_count (float): Percent of chromosomes for mutation, distributed in range (0, 1] and
365 thus amount of chromosomes is defined as follows: `chromosome_count` * `coeff_mutation_count`
366 - select_coeff (float): Exponential coefficient for selection procedure that is used as follows:
367 `math.exp(1 + fitness(chromosome) * select_coeff)`.
368 - crossover_rate (float): Crossover rate.
369 - observer (ga_observer): Observer that is used for collecting information of about clustering process on each step.
370 - random_state (int): Seed for random state (by default is `None`, current system time is used).
375 numpy.random.seed(kwargs.get(
'random_state',
None))
378 self.
_data = numpy.array(data)
403 'best_fitness_function': 0.0}
413 @brief Perform clustering procedure in line with rule of genetic clustering algorithm.
423 best_chromosome, best_ff, first_fitness_functions \
428 self.
_observer.collect_global_best(best_chromosome, best_ff)
429 self.
_observer.collect_population_best(best_chromosome, best_ff)
430 self.
_observer.collect_mean(first_fitness_functions)
445 new_best_chromosome, new_best_ff, fitness_functions \
449 if new_best_ff < best_ff:
450 best_ff = new_best_ff
451 best_chromosome = new_best_chromosome
455 self.
_observer.collect_global_best(best_chromosome, best_ff)
456 self.
_observer.collect_population_best(new_best_chromosome, new_best_ff)
457 self.
_observer.collect_mean(fitness_functions)
463 return best_chromosome, best_ff
468 @brief Returns genetic algorithm observer.
476 @brief Returns list of allocated clusters, each cluster contains indexes of objects from the data.
478 @return (list) List of allocated clusters.
488 def _select(chromosomes, data, count_clusters, select_coeff):
490 @brief Performs selection procedure where new chromosomes are calculated.
492 @param[in] chromosomes (numpy.array): Chromosomes
497 centres = ga_math.get_centres(chromosomes, data, count_clusters)
500 fitness = genetic_algorithm._calc_fitness_function(centres, data, chromosomes)
501 fitness = numpy.exp(1.0 + fitness * select_coeff)
504 probabilities = ga_math.calc_probability_vector(fitness)
507 new_chromosomes = numpy.zeros(chromosomes.shape, dtype=numpy.int)
510 for _idx
in range(len(chromosomes)):
511 new_chromosomes[_idx] = chromosomes[ga_math.get_uniform(probabilities)]
513 return new_chromosomes
517 def _crossover(chromosomes):
519 @brief Crossover procedure.
524 pairs_to_crossover = numpy.array(range(len(chromosomes)))
527 numpy.random.shuffle(pairs_to_crossover)
530 offset_in_pair = int(len(pairs_to_crossover) / 2)
533 for _idx
in range(offset_in_pair):
536 crossover_mask = genetic_algorithm._get_crossover_mask(len(chromosomes[_idx]))
539 genetic_algorithm._crossover_a_pair(chromosomes[pairs_to_crossover[_idx]],
540 chromosomes[pairs_to_crossover[_idx + offset_in_pair]],
545 def _mutation(chromosomes, count_clusters, count_gen_for_mutation, coeff_mutation_count):
547 @brief Mutation procedure.
552 count_gens = len(chromosomes[0])
555 random_idx_chromosomes = numpy.array(range(len(chromosomes)))
556 numpy.random.shuffle(random_idx_chromosomes)
559 for _idx_chromosome
in range(int(len(random_idx_chromosomes) * coeff_mutation_count)):
562 for _
in range(count_gen_for_mutation):
565 gen_num = numpy.random.randint(count_gens)
568 chromosomes[random_idx_chromosomes[_idx_chromosome]][gen_num] = numpy.random.randint(count_clusters)
572 def _crossover_a_pair(chromosome_1, chromosome_2, mask):
574 @brief Crossovers a pair of chromosomes.
576 @param[in] chromosome_1 (numpy.array): The first chromosome for crossover.
577 @param[in] chromosome_2 (numpy.array): The second chromosome for crossover.
578 @param[in] mask (numpy.array): Crossover mask that defines which genes should be swapped.
582 for _idx
in range(len(chromosome_1)):
586 chromosome_1[_idx], chromosome_2[_idx] = chromosome_2[_idx], chromosome_1[_idx]
590 def _get_crossover_mask(mask_length):
592 @brief Crossover mask to crossover a pair of chromosomes.
594 @param[in] mask_length (uint): Length of the mask.
599 mask = numpy.zeros(mask_length)
602 mask[:int(int(mask_length) / 2)] = 1
605 numpy.random.shuffle(mask)
611 def _init_population(count_clusters, count_data, chromosome_count):
613 @brief Returns first population as a uniform random choice.
615 @param[in] count_clusters (uint): Amount of clusters that should be allocated.
616 @param[in] count_data (uint): Data size that is used for clustering process.
617 @param[in] chromosome_count (uint):Amount of chromosome that is used for clustering.
621 population = numpy.random.randint(count_clusters, size=(chromosome_count, count_data))
627 def _get_best_chromosome(chromosomes, data, count_clusters):
629 @brief Returns the current best chromosome.
631 @param[in] chromosomes (list): Chromosomes that are used for searching.
632 @param[in] data (list): Input data that is used for clustering process.
633 @param[in] count_clusters (uint): Amount of clusters that should be allocated.
635 @return (list, float, list) The best chromosome, its fitness function value and fitness function values for
641 centres = ga_math.get_centres(chromosomes, data, count_clusters)
644 fitness_functions = genetic_algorithm._calc_fitness_function(centres, data, chromosomes)
647 best_chromosome_idx = fitness_functions.argmin()
650 return chromosomes[best_chromosome_idx], fitness_functions[best_chromosome_idx], fitness_functions
654 def _calc_fitness_function(centres, data, chromosomes):
656 @brief Calculate fitness function values for chromosomes.
658 @param[in] centres (list): Cluster centers.
659 @param[in] data (list): Input data that is used for clustering process.
660 @param[in] chromosomes (list): Chromosomes whose fitness function's values are calculated.
662 @return (list) Fitness function value for each chromosome correspondingly.
667 count_chromosome = len(chromosomes)
670 fitness_function = numpy.zeros(count_chromosome)
673 for _idx_chromosome
in range(count_chromosome):
676 centres_data = numpy.zeros(data.shape)
679 for _idx
in range(len(data)):
680 centres_data[_idx] = centres[_idx_chromosome][chromosomes[_idx_chromosome][_idx]]
683 fitness_function[_idx_chromosome] += numpy.sum(abs(data - centres_data))
685 return fitness_function
688 def _verify_arguments(self):
690 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness.
693 if len(self.
_data) == 0:
694 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
_data))
697 raise ValueError(
"Amount of cluster (current value: '%d') for allocation should be greater than 0." %