Chapter 3 Data transformation

The dataset we downloaded is a .csv file, so we just read it in R.

loan <- read.csv(file=“loan.csv”, header=TRUE, sep=“,”)

More data transformation is using “pandas” Python Library. We selected the following data we need in our analysis:

zip_code : The first 3 numbers of the zip code provided by the borrower in the loan application.

annual_inc : The self-reported annual income provided by the borrower during registration.

addr_state : The state provided by the borrower in the loan application

issue_d : The month which the loan was funded

grade : LC assigned loan grade

purpose : A category provided by the borrower for the loan request.

loan_status : Current status of the loan

home_ownership : The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER

emp_length : Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

One part of cleaning data is to delete incorrect data that has the wrong zip code corresponding to the state.