Alejandra Plazola and MAP5667
Dog popularity, intelligence, and breed info.
Recently, I have been into dogs and raising my pups, I did quite a bit of background research on dog breeds before selecting which breed was the best at fitting into my lifestyle plus my past experiences of raising certain dog breeds like German Shepherds, Labs, and Rottweilers.
This got me to thinking that I automatically wanted to do my project one over dog breed intelligence and comparing it to dog breed popularity.I wanted to compare dogs intelligence and see if intelligence is a type of trait that people are interested in when picking a dog and it's overall popularity for that breed.
I have chosen 3 data sets "AKC_Popularity_Breeds_2013-2019", "AKC Breed Info" and "dog_intelligence". These datasets were obtained through dataworld:https://data.world/len/akc-popular-dog-breeds, https://data.world/len/dog-canine-breed-size-akc, and https://data.world/len/intelligence-of-dogs. (For some reason data world doesn't connect through html here but it does if you open the link in a new tab.)
For the "dog_intelligence" dataset it contains 5 different variables, Breed = dog breed, Classification = what type of dog the breed is categorized, obey = the probability that the breed obeys the first command, reps_lower= lower limit of repetitions the breed takes to understand new commands, and reps_upper= upper limit of repetitions the breed takes to understand new commands. This dataset has 136 observations in total.
For the "AKC_Popular_Breeds_2013-2019" dataset, it has 8 variables and 193 observations. The 8 different variables are Breed= dog breed,and then popularity ranking for each year from 2013-2019 (Rank 2013, Rank 2014, Rank 2015, Rank 2016, Rank 2017, Rank 2018, and Rank 2019). I also updated this dataset from it's original on dataworld to include more recent rankings.
The new rankings sources I added from 2017-19 used was from: https://www.akc.org/expert-advice/news/most-popular-dog-breeds-full-ranking-list/ and https://www.akc.org/expert-advice/dog-breeds/2020-popular-breeds-2019/. There was a ranking from 2020, but it was a predicted one and since the year is not over I did not want to include it.
To reiterate these are interesting because I see that some people get certain dog breeds because their family or friends have them, sometimes they see it off a movie/tv show, or even know the type of dog they want such as a working breed, herding dog type, guard dogs, etc. I wanted to see if intelligence plays a big role in popularity of dogs as I know that most of the dogs I've had growing up were due to the stigma around the breed, such as they are smart or amazing guard dogs. Over the summer my family got two new puppies, a Beglian Malinois, and a Great Pyrenees/German Shepherd mix. They were picked because they are known to be good guard dogs/working dogs, considered very bright, and that was something my father wanted for the home, also they are good family companions. Another factor that people use to choose animals is their size, many people like bigger breeds and others like smaller.
I want to see if there is a correlation between popularity of dog breeds between it's intelligence rank, and it's size. I predict people will choose dog breeds that are the smartest.
library(tidyverse)
Dogp <- read_csv("/stor/home/map5667/Website/content/project/AKC_Popular_Breeds_2013-2019.csv")
Dogint <- read_csv("/stor/home/map5667/Website/content/project/dog_intelligence.csv")
Dogbreed <- read_csv("/stor/home/map5667/Website/content/project/AKC Breed Info.csv")
# tidying only Dog popularity ranks
Dogpp <- Dogp %>% rename(`2019` = "2019 Rank", `2018` = "2018 Rank",
`2017` = "2017 Rank", `2016` = "2016 Rank", `2015` = "2015 Rank",
`2014` = "2014 Rank", `2013` = "2013 Rank")
Dogpop <- Dogpp %>% pivot_longer(cols = c("2013", "2014", "2015",
"2016", "2017", "2018", "2019"), names_to = "Year", values_to = "Rank") %>%
na.omit
Dogpop %>% glimpse()
## Rows: 1,309
## Columns: 3
## $ Breed <chr> "American Water Spaniel", "American Water Spaniel", "American W…
## $ Year <chr> "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2013",…
## $ Rank <dbl> 144, 160, 157, 156, 167, 166, 164, 143, 144, 136, 149, 147, 148…
Dogpop %>% summarize_all(n_distinct)
## # A tibble: 1 x 3
## Breed Year Rank
## <int> <int> <int>
## 1 193 7 193
I first renamed all year ranks and pivoted longer into a seperate column by year and rank for the dog popularity dataset and omitted any n/a rows from the set
Dogintel <- Dogint %>% rename(Lower = "reps_lower", Upper = "reps_upper") %>%
pivot_longer(cols = c("Lower", "Upper"), names_to = "Reps",
values_to = "Repnum")
Dogin <- Dogintel %>% group_by(Breed) %>% mutate(avgReps = mean(Repnum,
na.rm = T)) %>% select(-Repnum, -Reps) %>% distinct()
Dogin %>% glimpse()
## Rows: 136
## Columns: 4
## Groups: Breed [136]
## $ Breed <chr> "Affenpinscher", "Afghan Hound", "Airedale Terrier", "…
## $ Classification <chr> "Above Average Working Dogs", "Lowest Degree of Workin…
## $ obey <chr> "70%", NA, "70%", "50%", "50%", "50%", "70%", "50%", "…
## $ avgReps <dbl> 20.5, 90.5, 20.5, 33.0, 33.0, 33.0, 20.5, 33.0, 2.5, 3…
Dogin %>% head()
## # A tibble: 6 x 4
## # Groups: Breed [6]
## Breed Classification obey avgReps
## <chr> <chr> <chr> <dbl>
## 1 Affenpinscher Above Average Working Dogs 70% 20.5
## 2 Afghan Hound Lowest Degree of Working/Obedience Intelligen… <NA> 90.5
## 3 Airedale Terrier Above Average Working Dogs 70% 20.5
## 4 Akita Average Working/Obedience Intelligence 50% 33
## 5 Alaskan Malamute Average Working/Obedience Intelligence 50% 33
## 6 American Foxhound Average Working/Obedience Intelligence 50% 33
Dogbd <- Dogbreed %>% rename(Low.ht = "height_low_inches", High.ht = "height_high_inches",
Low.wt = "weight_low_lbs", High.wt = "weight_high_lbs") %>%
pivot_longer(cols = c("Low.ht", "High.ht"), names_to = "Ht",
values_to = "Ht.inch") %>% pivot_longer(cols = c("Low.wt",
"High.wt"), names_to = "Wt", values_to = "Wt.lbs")
Dogbdinfo <- Dogbd %>% na.omit() %>% group_by(Breed) %>% mutate(avght = mean(Ht.inch,
na.rm = T), avgwt = mean(Wt.lbs, na.rm = T)) %>% select(-Ht.inch,
-Ht, -Wt.lbs, -Wt) %>% distinct()
Dogbdinfo %>% glimpse()
## Rows: 148
## Columns: 3
## Groups: Breed [148]
## $ Breed <chr> "Affenpinscher", "Afghan Hound", "Airdale Terrier", "Akita", "A…
## $ avght <dbl> 10.5, 26.0, 23.0, 27.0, 14.0, 23.5, 18.0, 16.5, 28.0, 18.5, 20.…
## $ avgwt <dbl> 10.0, 55.0, 45.0, 100.0, 27.5, 67.5, 45.0, 35.0, 125.0, 40.0, 5…
Dogbdinfo %>% head()
## # A tibble: 6 x 3
## # Groups: Breed [6]
## Breed avght avgwt
## <chr> <dbl> <dbl>
## 1 Affenpinscher 10.5 10
## 2 Afghan Hound 26 55
## 3 Airdale Terrier 23 45
## 4 Akita 27 100
## 5 American Eskimo 14 27.5
## 6 American Foxhound 23.5 67.5
I did renamed the lower and higher reps for the dog intelliegence dataset and seperated them into categories into their own column and values into their own. Then, I averaged all reps to get rid of double variables and have an avg of both high/low reps. Next, I renamed the height and weight columns, pivot longer both, and then averaged height and weight into their own columns so it looks tidier.
Dogpopnew <- Dogpop %>% group_by(Breed) %>% mutate(AvgRank = mean(Rank,
na.rm = T)) %>% select(-Rank, -Year) %>% distinct()
Dogs <- Dogpopnew %>% full_join(Dogin, by = c(Breed = "Breed"))
Doggos <- Dogs %>% full_join(Dogbdinfo, by = c(Breed = "Breed"))
DognoNA <- Doggos %>% na.omit()
DognoNA %>% glimpse()
## Rows: 102
## Columns: 7
## Groups: Breed [102]
## $ Breed <chr> "American Water Spaniel", "Affenpinscher", "Akita", "A…
## $ AvgRank <dbl> 159.14286, 145.28571, 46.28571, 184.00000, 81.57143, 5…
## $ Classification <chr> "Average Working/Obedience Intelligence", "Above Avera…
## $ obey <chr> "50%", "70%", "50%", "50%", "70%", "95%", "50%", "70%"…
## $ avgReps <dbl> 33.0, 20.5, 33.0, 33.0, 20.5, 2.5, 33.0, 20.5, 20.5, 3…
## $ avght <dbl> 16.5, 10.5, 27.0, 23.5, 18.0, 18.5, 20.5, 10.0, 21.0, …
## $ avgwt <dbl> 35.0, 10.0, 100.0, 67.5, 45.0, 40.0, 50.0, 12.0, 50.0,…
DognoNA %>% head()
## # A tibble: 6 x 7
## # Groups: Breed [6]
## Breed AvgRank Classification obey avgReps avght avgwt
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 American Water Sp… 159. Average Working/Obedienc… 50% 33 16.5 35
## 2 Affenpinscher 145. Above Average Working Do… 70% 20.5 10.5 10
## 3 Akita 46.3 Average Working/Obedienc… 50% 33 27 100
## 4 American Foxhound 184 Average Working/Obedienc… 50% 33 23.5 67.5
## 5 American Stafford… 81.6 Above Average Working Do… 70% 20.5 18 45
## 6 Australian Cattle… 55.6 Brightest Dogs 95% 2.5 18.5 40
First, I had to mutate my ranking for all years for my Dogpop dataset to make it tidyier and make sure each breed only comes out once (Breed= AvgRank). I then full joined my new dataset for dog popularity with Dog intelligence data, then full joined this new combined data (Doggos) with my dog breed information dataset (Dogbd). I dropped all n/a in all datasets and once I joined them I dropped them again if they didn't have any correlation between the 3 datasets to get rid of dog breeds that have no information in the other datasets.
I chose to full join because I wanted to keep all variables(AvgRank, Classification, Obey%, Height (low/high), Weight (low/high), and reps (low/high)) in each dataset after tidying in a new dataset by combining it to the common column 'Breed'.
DogRH <- DognoNA %>% group_by(Breed) %>% arrange(-desc(AvgRank)) %>%
mutate(Avg_Reps_per_ht = avgReps/avght)
DogRH %>% head()
## # A tibble: 6 x 8
## # Groups: Breed [6]
## Breed AvgRank Classification obey avgReps avght avgwt Avg_Reps_per_ht
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Labrador… 1 Brightest Dogs 95% 2.5 22.5 67.5 0.111
## 2 German S… 2 Brightest Dogs 95% 2.5 24 82.5 0.104
## 3 Golden R… 3 Brightest Dogs 95% 2.5 22.5 65 0.111
## 4 French B… 6.29 Fair Working/Obed… 30% 60.5 11.5 22.5 5.26
## 5 Yorkshir… 8.43 Above Average Wor… 70% 20.5 8 5 2.56
## 6 Rottweil… 8.57 Brightest Dogs 95% 2.5 24.5 100 0.102
DogRW <- DognoNA %>% group_by(Breed) %>% arrange(-desc(AvgRank)) %>%
mutate(Avg_Reps_per_wt = avgReps/avgwt)
DogRW %>% head()
## # A tibble: 6 x 8
## # Groups: Breed [6]
## Breed AvgRank Classification obey avgReps avght avgwt Avg_Reps_per_wt
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Labrador… 1 Brightest Dogs 95% 2.5 22.5 67.5 0.0370
## 2 German S… 2 Brightest Dogs 95% 2.5 24 82.5 0.0303
## 3 Golden R… 3 Brightest Dogs 95% 2.5 22.5 65 0.0385
## 4 French B… 6.29 Fair Working/Obed… 30% 60.5 11.5 22.5 2.69
## 5 Yorkshir… 8.43 Above Average Wor… 70% 20.5 8 5 4.1
## 6 Rottweil… 8.57 Brightest Dogs 95% 2.5 24.5 100 0.025
# Top 5 most popular breeds based off of rank in the Average
# Working/Obedience Intelligence
DogObey50top <- DognoNA %>% filter(Classification == "Average Working/Obedience Intelligence") %>%
select(AvgRank, avgwt, avght) %>% arrange(-desc(AvgRank))
DogObey50top %>% head(5)
## # A tibble: 5 x 4
## # Groups: Breed [5]
## Breed AvgRank avgwt avght
## <chr> <dbl> <dbl> <dbl>
## 1 Boxer 10.1 67.5 23
## 2 Dachshund 11.9 24 8.5
## 3 Siberian Husky 13.1 50 21.5
## 4 Great Dane 15.3 140 32
## 5 Australian Shepherd 16.6 50 20.5
# Bottom 5 most popular breeds based off of rank in the
# Average Working/Obedience Intelligence
DogObey50bottom <- DognoNA %>% filter(Classification == "Average Working/Obedience Intelligence") %>%
select(AvgRank, avgwt, avght) %>% arrange(desc(AvgRank))
DogObey50bottom %>% head(5)
## # A tibble: 5 x 4
## # Groups: Breed [5]
## Breed AvgRank avgwt avght
## <chr> <dbl> <dbl> <dbl>
## 1 English Foxhound 186. 67.5 23.5
## 2 American Foxhound 184 67.5 23.5
## 3 Finnish Spitz 178. 33 17.5
## 4 Curly Coated Retriever 162. 72.5 26
## 5 American Water Spaniel 159. 35 16.5
Dogcor <- DognoNA %>% group_by(Classification) %>% summarize(mean_rank = mean(AvgRank),
sd_rank = sd(AvgRank), variance = var(AvgRank, y = NULL,
na.rm = FALSE), count_rank = n(), quantile = qnorm(p = 0.5,
mean = mean_rank, sd = sd_rank), min_rank = min(AvgRank),
max_rank = max(AvgRank), distinct_rank = n_distinct(AvgRank),
cor1 = cor(avght, AvgRank), cor2 = cor(avgwt, AvgRank))
Dogcor %>% head()
## # A tibble: 5 x 11
## Classification mean_rank sd_rank variance count_rank quantile min_rank
## <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Above Average… 99.1 41.7 1736. 25 99.1 8.43
## 2 Average Worki… 90.9 59.9 3588. 32 90.9 10.1
## 3 Brightest Dogs 21.7 21.0 442. 9 21.7 1
## 4 Excellent Wor… 58.4 41.5 1719. 17 58.4 10.7
## 5 Fair Working/… 83.7 52.2 2721. 19 83.7 6.29
## # … with 4 more variables: max_rank <dbl>, distinct_rank <int>, cor1 <dbl>,
## # cor2 <dbl>
DognoNA %>% group_by(Classification) %>% summarize(mean_rank = mean(AvgRank),
sd_rank = sd(AvgRank), min_rank = min(AvgRank), max_rank = max(AvgRank),
variance_rank = var(AvgRank, y = NULL, na.rm = FALSE))
## # A tibble: 5 x 6
## Classification mean_rank sd_rank min_rank max_rank variance_rank
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Above Average Working Dogs 99.1 41.7 8.43 167. 1736.
## 2 Average Working/Obedience I… 90.9 59.9 10.1 186. 3588.
## 3 Brightest Dogs 21.7 21.0 1 55.6 442.
## 4 Excellent Working Dogs 58.4 41.5 10.7 154. 1719.
## 5 Fair Working/Obedience Inte… 83.7 52.2 6.29 175. 2721.
'DogRH', 'DogRW', 'DogObey50top', and 'DogObey50bottom' all show the 6 dplyr functions I used except for summarize, but I did end up using it in 'DogCor'. In DogRH and DogRW, I mutated to find the proportion of average Reps and average height in DogRH, and average reps and average weight in DogRW. I filters my DognoNA into classification by 50% or the Average Working/Obedience Intelligence Dog Breeds, and arragned them by lowest ranking(top 5 most popular) and then highest ranking (top 5 least most popular). I used the summarize function in DogCor,the 1variance tables for avgRank, and grouped them by dog intelligence classification for each dog breed.
Based on my data produced, for each classification all summary stats were produced. The above average working dogs, the mean popularity was rank was 99.07, the sd was 41.67, variance was 1735.86, the total count of each breed is 25, the quantile was 99.07, the min popularity rank was 8.42, max popularity rank was 167.14, the 25 distant rankings, and both negative correlations for popularity ranking and weight, and popularity ranking and height.The average working/obedience intelligence, the mean popularity was rank was 90.87, the sd was 59.89, variance was 3587.86, the total count of each breed is 32, the quantile was 90.877, the min popularity rank was 10.14, max popularity rank was 185.85, the 31 distinct rankings, and both positive correlations for popularity ranking and weight, and popularity ranking and height.The Brightest Dogs, the mean popularity was rank was 21.73, the sd was 21.01, variance was 441.59, the total count of each breed is 9, the quantile was 21.73, the min popularity rank was 1, max popularity rank was 55.57, the 9 distanct rankings, and both negative correlations for popularity ranking and weight, and popularity ranking and height.The Excellent Working Dogs, the mean popularity was rank was 58.43, the sd was 41.46, variance was 1719.23, the total count of each breed is 17, the quantile was 58.43, the min popularity rank was 10.7, max popularity rank was 154.28, the 16 distanct rankings, and both negative correlations for popularity ranking and weight, and popularity ranking and height.The Fair Working/Obedience Intelligence, the mean popularity was rank was 83.69, the sd was 52.16, variance was 2721.25, the total count of each breed is 19, the quantile was 83.69, the min popularity rank was 6.28, max popularity rank was 175.42, the 19 distanct rankings, and both negative correlations for popularity ranking and weight, and popularity ranking and height. I did expect these correlations between avg ranking and average ht/wt, except for the average working/obedience intelligence because that had some positive correlations between their ranking and weight/height size. When looking at variances for average popularity ranking for all 5 intelligence classification, Brightest Dogs tended to have the least mean, sd, min, max, and variance as the overall popularity ranking for these dog breeds are lower (meaning they are overall more popular).
ggplot(DognoNA, aes(avgwt, AvgRank)) + geom_point(aes(color = Classification)) +
labs(x = "Average Weight of Dog Breeds (lbs)", y = "Average Popularity Rank from years 2013-19",
title = "Weight and Popularity Ranking of Intelligence Classification") +
labs(colour = "Classification") + scale_y_continuous(breaks = seq(0,
190, 25)) + scale_x_continuous(breaks = seq(0, 150, 25))
This graph shows a plot between average weight of dog breeds, average height of average popularity ranking, and 5 intelligence classification for all dog breeds. There is not much of a correlation in trends but you can see that above average working dogs tend to skew to a higher popularity ranking (meaning it is lower on a list of 1-193), as it's weight tends to be lower in lbs. You can't really see a relationship between weight and ranking. I changed the scaling on both the y and x axis and had all of them colored by classification.
ggplot(DognoNA, aes(x = Classification, y = AvgRank, fill = Classification)) +
geom_bar(stat = "summary", fun = mean) + scale_fill_brewer(palette = "Pastel2") +
scale_x_discrete(labels = c("Above Avg Wk", "Avg Wk/Ob",
"Brightest", "Excellent", "Fair Wk/Ob")) + labs(x = "Intelligence Classification of Dogs",
y = "Average Popularity Rank through 2013-19", title = "Average Popularity Ranking of Each Dog Intelligence Classification") +
scale_y_continuous(breaks = seq(0, 110, 10)) + geom_errorbar(stat = "summary",
width = 0.5) + theme(legend.text = element_text(size = 6))
This graphbar shows the number of average rankings per each dog intelligence classifications with a standard error bar. Just looking at the graph you can see that the brightest dogs have a overall smaller average ranking for popularity ranking meaning that it typically tends to be the most popular breeds that fall into this category and people tend to choose/own they dogs. The overall highest average ranking for popularity meaning it tends to be less popular dog breed are the above average working dogs, and people tend to choose/own these dogs less.I changed both x and y scales to make everything neater and more readable. I added color to each classification and changed the font size for the legend. I also changed the sizing of the SE bars.
library(tidyverse)
dgi <- Dogpop %>% full_join(Dogin, by = c(Breed = "Breed"))
dggg <- dgi %>% full_join(Dogbdinfo, by = c(Breed = "Breed"))
dg <- dggg %>% select_if(is.numeric) %>% na.omit()
dg %>% select_if(is.numeric) %>% cor(use = "pair")
## Rank avgReps avght avgwt
## Rank 1.000000000 0.07110275 -0.007939942 -0.11564849
## avgReps 0.071102750 1.00000000 -0.157220883 -0.01488321
## avght -0.007939942 -0.15722088 1.000000000 0.70747102
## avgwt -0.115648490 -0.01488321 0.707471015 1.00000000
cormat <- dg %>% select_if(is.numeric) %>% cor(use = "pair")
cordat <- cormat %>% as.data.frame %>% rownames_to_column("var1") %>%
pivot_longer(-1, names_to = "var2", values_to = "corelation")
cordat %>% ggplot(aes(var1, var2, fill = corelation)) + geom_tile() +
geom_text(aes(label = round(corelation, 2))) + xlab("") +
ylab("") + coord_fixed() + scale_fill_gradient2(low = "white",
mid = "pink", high = "red") + theme(axis.text.x = element_text(angle = 90,
vjust = 0.5, hjust = 1))
This heat map shows the correlation between 4 numeric variables in the 'dg' dataset. The dg dataset was using the untidyed popularity ranking per year as it made it easier to see the correlations than having it's average so I full joined that with the other tidied datasets and I will be using this one for dimensionality reduction as well. You can see that average height and weight do have a positve correlation meaning that they do affect each other in dog breeds, which is to be expected (taller dog/more it weighs and vice versa), there is a negative correlation between dog intelligence (avgReps) and height and weight (so they don't affect/correlate each other), there is also a negative correlation for height and weights between ranking (so no correlation for ranking and weight or ranking and height), and finally there is a small postive correlation for average reps (Dog intelligence) and ranking (so dog intelligence does affect popularity ranking).
library(cluster)
wss <- vector()
for (i in 1:4) {
dgt <- dg %>% dplyr::select(Rank, avgReps, avght, avgwt) %>%
kmeans(., i)
wss[i] <- dgt$tot.withinss
}
ggplot() + geom_point(aes(x = 1:4, y = wss)) + geom_path(aes(x = 1:4,
y = wss)) + xlab("clusters") + scale_x_continuous(breaks = 1:4)
kmeans1 <- dg %>% scale %>% kmeans(2)
kclust <- dg %>% mutate(cluster = as.factor(kmeans1$cluster))
kclust %>% ggplot(aes(avght, avgwt, color = cluster)) + geom_point()
kclust %>% mutate(reps = dg$avgReps) %>% ggplot(aes(avgwt, avght,
color = avgReps, shape = cluster)) + geom_point(size = 3) +
ggtitle("Cluster using kmeans")
So I first determined the number of clustered I need, and I determined I needed 2 based off the number of numeric variables I had. I used kmeans to cluster my data and create my graphs It's hard to tell but you can see that the lower reps () tend to be spread out between 0-100 average weight and be less than 35 inches of height. You can't really see any other correlations as it is all over the place so associations are hard to see.