Homework 10

Due Monday November 26 at 11:59pm

Homework policies and submission instructions

The homework should be done using python. Starter code for some problems has been posted on Piazza.

Make sure to upload all python files related to your homework in addition to your pdf report. Please do not compress the pdf report along with your code files. It is up to you if you would like to upload the code as a compressed file or not as long as the pdf file is uploaded separate from the the compressed file. We will be running your code using either python 2 or 3, so if you choose not to use jupyter notebook, then please expect us to run your code as follows:

python your-python-code-file-name.py

When we're grading, we will handle modifying your code to load data correctly, so you do not need to concern yourself with standardizing where you place your data files while completing the assignment.

Good luck, and please get started early!

Problems

  1. (20 points) Textbook problem 11.1(a). Do not do parts (b) or (c).
  2. (20 points) Textbook problem 11.4
  3. (10 points) Textbook problem 11.7

This homework deviates from the textbook in multiple ways. First, we will be using sci-kit learn as opposed to the R packages.

--------------11.1 NOTES--------------

For 11.1(a) there are multiple important changes. We would like you to use the following data:

pima-indians-diabetes.data

The description for the data is provided here:

pima-indians-diabetes.names

For 11.1(a) we would like you to use the first 80% of the data file for your training set, and the last 20% of the data file for your evaluation set. These two sets should be obtained in the order listed in the data file itself.

For 11.1(a) we would like you to report the class confusion matrix that you obtain when evaluating your Naive Bayes classifier on your evaluation set. We would also like you to report the accuracy and error rate of your classifier.

--------------11.4 NOTES--------------

For the SVM dataset, use the data file with the name "wdbc.data". Use the file "wdbc.names" to guide your data cleaning and preprocessing process. The choice of which columns to drop will become apparent if you carefully read through "wdbc.names"

--------------11.7 NOTES--------------

For the random forest question, please use an 80-20 training/evaluation set split of your data. This means that 80% of your data will be used for fitting/training your random forest classifier, and 20% of your data will be used to determine the accuracy and the class confusion matrix of your classifier. The ordering of this split is left up to you.

Since we are not using R's random forest classifier, please use scikit-learn's random forest classifier which can be imported using the following:

from sklearn.ensemble import RandomForestClassifier

The API guide is here: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html