Titanic with r

Kaggel + Titanic analysis
with Data Science
R 설치부터 Random forest까지
NEXT 16년 3학기 : Data visualization
김명찬

타이타닉 예제
• http://trevorstephens.com/kaggle-titanic-
tutorial/getting-started-with-r/
• R 사용법 부터 random forest 사용법까지
상세하게 단계별로 설명함.

오늘의 과정
1. Kaggle 제출 예제
– 전부 죽이기
2. Manually fitted model
– 수동으로 노가다
3. Deicision tree
4. Feature Engineering
5. Random Forest

Kaggle
• 머신러닝 문제 풀이 온라인 플랫폼
• 포럼
• 예제
• 등등 좋은 정보가 엄청 많음.
– 단, 영어.

오늘의 문제 : 타이타닉
• Titanic : Machine Learning from Disaster
• https://www.kaggle.com/c/titanic/
• Home -> Data 에서 데이터 다운로드 가능
• Home -> Make a submission 으로 제출 가능

일단 자료를 볼까요
• 다운로드
• train.csv
• test.csv
• test파일에는 한 컬럼이 없음.
• 걔를 답으로 맞춰서 제출 하는 것!

파일 읽기
• table , summary 로 대충 보기 ( EDA??? )
– 모든 단계에서 버릇처럼 해보셔요
• 62% 가 돌아가셨네…
 train <- read.csv("train.csv", stringsAsFactors=FALSE)
 table(train$Survived)
 prop.table(table(train$Survived))

1. 예측을 해보자
• 타이타닉은 엄청난 재난이었군.
• 그럼 일단 탔으면 죽은걸로.
• 제출해보자
 test$Survived <- rep(0, 418)
 submit <- data.frame(PassengerId = test$PassengerId, Survived =
test$Survived)
 write.csv(submit, file = "theyallperish.csv”, row.names = FALSE)
Home -> Make a submission 으로 제출 가능

오우
• 이제 어디가서 모르는 사람에게 사기를 칠
수 있습니다.
• 당신은 세계랭킹 6000등.

2. 수동으로 특징 찾기
• Lady first? 정말일까?
• 정말이네…
 summary(train$Sex)
 prop.table(table(train$Sex, train$Survived))
 prop.table(table(train$Sex, train$Survived), 1)

2. 수동으로 특징찾기 : 제출
• 위 과정과 동일
 test$Survived <- 0
 test$Survived[test$Sex == 'female'] <- 1
 submit <- data.frame(PassengerId = test$PassengerId, Survived =
test$Survived)
 write.csv(submit, file = ”femaleSurvied.csv”, row.names = FALSE)

2. 수동으로 특징찾기
• 나이는 어떨까?
 summary(train$Age)
 train$Child <- 0
 train$Child[train$Age < 18] <- 1

3. Decision tree
• Greedy 한 decision tree 생성기
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data=train,
method="class")

4. Featuring Engineering
• 이미 있는 변수들 말고, 내가 직접 의미있는
변수를 만들어 내 보자.
• 이름을 분석하면 뭔가 나오지 않을까?

• 먼저 , 이름을 보면 Sir, Ms 등 귀족 여부를 알
수 있다.
• 이름을 쪼개서 신분을 알아내자!

• 두 테이블을 합친다.
– 새로운 Feature를 만들어야 하기 때문에
– 한 테이블에서 작업해야 Factor가 같이 생성됨
> test$Survived <- NA
> combi <- rbind(train, test)

• 어떻게 추출해야 하나?
• 한 항에 대해서 연습해보자.
> combi$Name <- as.character(combi$Name)
> combi$Name[1]
[1] "Braund, Mr. Owen Harris”
> strsplit(combi$Name[1], split='[,.]')
[[1]]
[1] "Braund" " Mr" " Owen Harris“
> strsplit(combi$Name[1], split='[,.]')[[1]]
[1] "Braund" " Mr" " Owen Harris“
> strsplit(combi$Name[1], split='[,.]')[[1]][2]
[1] " Mr"

• 전체 테이블에 적용해 새로운 Feature 생성
 combi$Title <- sapply(combi$Name, FUN=function(x) {strsplit(x,
split='[,.]')[[1]][2]})
 combi$Title <- sub(' ', '', combi$Title)
 table(combi$Title)
Capt Col Don Dona Dr Jonkheer Lady
1 4 1 1 8 1 1
Major Master Miss Mlle Mme Mr Mrs
2 61 260 2 1 757 197
Ms Rev Sir the Countess
2 8 1 1

• Factor들을 세세히 살펴보고 의미가
같은것을 묶음.
 combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle’
 combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir’
 combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <-
'Lady’
 combi$Title <- factor(combi$Title)

• 한 가족은 아마도 공동운명체가 아닐까?
– 같이 살거나 같이 죽거나…
• 이름으로 같은 가족이었던 사람들을
알아내보자!
– 성씨 같은걸로 알아내면 되겠지.
– 흔한 성씨는 어떻게 처리할지 고민해보자

• 먼저… 대가족이 오히려 살기 어려웠을 수
있으니
• 가족 수를 먼저 구해본다.
• 형제+배우자+부모+자식+본인
 combi$FamilySize <- combi$SibSp + combi$Parch + 1

• Family name 찾아내 factor화 해보자
• 잘 보면 숫자가 맞지 않음… 3명이 있어야 하는
가족이 1명밖에 없는것 보이나요?
 combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x,
split='[,.]')[[1]][1]})
 combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep="")
 combi$FamilyID[combi$FamilySize <= 2] <- 'Small’
 table(combi$FamilyID)
11Sage 3Abbott 3Appleton 3Beckwith 3Boulos
11 3 1 2 3
3Bourke 3Brown 3Caldwell 3Christy 3Collyer
3 4 3 2 3
3Compton 3Cornell 3Coutts 3Crosby 3Danbom
3 1 3 3 3
. . .

• 1, 2명 있는 개인플레이 분들을 모아서
분리하는 과정에서 오류가 생긴듯..
• 가족수가 잘 안맞는 분들도 있다.
• 다른 접근 방법

• 성씨만 뽑아내서 따로 data frame으로
만들자
 famIDs <- data.frame(table(combi$FamilyID))

• 여기서 2명 이하인 성씨를 삭제
• 원래 자료에서 famIDs에 속하는 성과 안
속하는 성으로 나눈다.
 famIDs <- famIDs[famIDs$Freq <= 2,]
 combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small’
 combi$FamilyID <- factor(combi$FamilyID)

• 다시 train, test 분리후
• 디시전 트리 생성
 train <- combi[1:891,]
 test <- combi[892:1309,]
 fit <- rpart(
Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title
+ FamilySize + FamilyID,
data=train,
method="class")

• 성별보다 정확한 Ms, Mr, Sir 등등

쫘잔

5. Random Forest
• Random Forest란?
• 먼저 ensemble이란?
– 여러개의 Learning model을 가지고 조합함.

5. Random Forest
• Random Forest
– 랜덤하게 추출된 샘플로
– 여러개의 Decision tree를 만들어
– 각 tree들의 predict를 조합하여 최종 predict를
생성.

5. Random Forest
• Random Forest는 빈값이 없어야함…
• 나이에서 빈값 찾아 채우기
 Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked
+ Title + FamilySize,
data=combi[!is.na(combi$Age),], method="anova")
 combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age),])

5. Random Forest
• 또 빈 값 없나 찾아보자
• 채우자
– 의미를 생각해서 말이 되는 값으로 채워야 함!
> summary(combi)
> summary(combi$Embarked)
C Q S
2 270 123 914
> which(combi$Embarked == '')
[1] 62 830
> combi$Embarked[c(62,830)] = "S"
> combi$Embarked <- factor(combi$Embarked)

5. Random Forest
• 얘도 채우자…
> summary(combi$Fare)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 7.896 14.450 33.300 31.280 512.300 1
> which(is.na(combi$Fare))
[1] 1044
> combi$Fare[1044] <- median(combi$Fare, na.rm=TRUE)

5. Random Forest
• FamilyID도 factor화 해서 정리하고
> combi$FamilyID2 <- combi$FamilyID
> combi$FamilyID2 <- as.character(combi$FamilyID2)
> combi$FamilyID2[combi$FamilySize <= 3] <- 'Small'
> combi$FamilyID2 <- factor(combi$FamilyID2)

5. Random Forest
• Random Forest 라이브러리 설치 후 실행!!
> install.packages('randomForest')
> library(randomForest)
> set.seed(415)
> fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
+ Parch + Fare + Embarked + Title + FamilySize + FamilyID2,
data=train,
importance=TRUE,
ntree=2000)

5. Random Forest
• Feature별 중요도를 확인해보자
> varImpPlot(fit)

5. Random Forest
• 예측 실행! 얍… 엉?
• 더 떨어짐
> Prediction <- predict(fit, test)
> submit <- data.frame(PassengerId = test$PassengerId, Survived =
Prediction)
> write.csv(submit, file = "firstforest.csv", row.names = FALSE)

5. Random Forest
• Random Forest도 여러종류가 있음.
• 이거저거 다 해보자
• Let’s try a forest of conditional inference trees.
> install.packages('party')
> library(party)

5. Random Forest
• 얍!! 제발!!
> set.seed(415)
> fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare +
Embarked + Title + FamilySize + FamilyID,
data = train,
controls=cforest_unbiased(ntree=2000, mtry=3))
> Prediction <- predict(fit, test, OOB=TRUE, type = "response")

참고링크
• 메인
– http://trevorstephens.com/kaggle-titanic-
tutorial/getting-started-with-r/
• https://www.kaggle.com/c/titanic/details/gett
ing-started-with-random-forests
• 위 링크를 타고 타고… 더 많은 자료를
찾아보세요.

Titanic with r

More Related Content

Viewers also liked (20)

Titanic with r