Diagnosing Cancer with AI

Icon nn

Neural networks are mathematical models of the biological neuron. They have the ability to appoximate functional mappings from sample data (to put them mildy).

Ruby is an object oriented, functionally inspired, beautiful programming language. I used it to build the models because it's so simple to read, not because it's a practical choice for production.

Icon ruby

Step 1: Data data data

The first thing you need for machine learning is a data set. The data set I used can be found here on the popular UCI machine learning repository. The data consists of a set of measurements of digital images of various peoples breast tissue. A digital image is taken of an FNA (fine needle aspirate) and various measurements taken of the cells. Once you have the data, you split it into a training set (~60%) and a generalization set (~40%).

The data consists of real medical readings taken from a digital image of a piece of sample tissue (a fine needle aspirate, "FNA"). Each number in the comma separated lists below correspond to a specific attribute of the tissue measured from the image of this small piece of sample tissue.

The attributes consist of 10 measurements made on 3 cells, resulting in 30 attributes.

  • Radius (mean of distances from center to points on the perimeter)
  • Texture (standard deviation of gray-scale values)
  • Perimeter
  • Area
  • Smoothness (local variation in radius lengths)
  • Compactness (perimeter^2 / area - 1.0)
  • Concavity (severity of concave portions of the contour)
  • Concave points (number of concave portions of the contour)
  • Symmetry
  • Fractal dimension ("coastline approximation" - 1)

Here are some examples of measurements from the data set:

Malignant measurements

  • 17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
  • 13.17,18.66,85.98,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,0.2871,0.8937,1.897,24.25,0.006532,0.02336,0.02905,0.01215,0.01743,0.003643,15.67,27.95,102.8,759.4,0.1786,0.4166,0.5006,0.2088,0.39,0.1179
  • 18.65,17.6,123.7,1076,0.1099,0.1686,0.1974,0.1009,0.1907,0.06049,0.6289,0.6633,4.293,71.56,0.006294,0.03994,0.05554,0.01695,0.02428,0.003535,22.82,21.32,150.6,1567,0.1679,0.509,0.7345,0.2378,0.3799,0.09185

Benign measurements

  • 7.76,24.54,47.92,181,0.05263,0.04362,0,0,0.1587,0.05884,0.3857,1.428,2.548,19.15,0.007189,0.00466,0,0,0.02676,0.002783,9.456,30.37,59.16,268.6,0.08996,0.06444,0,0,0.2871,0.07039
  • 12.05,14.63,78.04,449.3,0.1031,0.09092,0.06592,0.02749,0.1675,0.06043,0.2636,0.7294,1.848,19.87,0.005488,0.01427,0.02322,0.00566,0.01428,0.002422,13.76,20.7,89.88,582.6,0.1494,0.2156,0.305,0.06548,0.2747,0.08301
  • 13.03,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.02923,0.1467,0.05863,0.1839,2.342,1.17,14.16,0.004352,0.004899,0.01343,0.01164,0.02671,0.001777,13.3,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169

Step 2: Training

The training set is used along with a training algorithm in order to correct the neural networks classification during training. By doing so, teaching it to classify the mass as either malignant or benign. I'm training the neural network with a particle swarm optimizer, a clever global search algorithm.

Let us train the network now...

Step 3: Evaluation

Once the training is complete it will show the trained networks average classification accuracy. This may vary greatly because I have not tweaked the algorithms parameters (train again if the accuracy is not sufficient).

Let us test the trained network by giving it 2 data patterns from the generalization set (the trained network never encountered these) and see if it classifies them as expected.

We expect the following unseen data to classify as Malignant.


We expect the following unseen data to classify as Benign.


You can submit your readings here to be classified with the pretrained neural network.

(try to copy some of the example measurements here from step 1)

Step 4: Share the love

To tell the truth, this is a very easy problem for AI to solve. In fact, it can be solved with statistical methods alone! With that being said, cancer is a destructive disease that has affected many people I hold dear. If we can solve the problem of cancer diagnosis that affects so many people, what other problems can be solved with machine learning? I hope you were inspired by this demonstration, and I hope you will share it and spread the love. Who knows what we can do with machine learning if we have the right data!