Preview of the Data
We first want to preview some information about all the data we have,
so we can establish a working plan. We are going to load all the tables
and make an exploration from the structure of every table. We are going
to use the readr R package to import the excel data into the r
workspace. Also we are going to load some others library that will be
use during the process.
#We change our system preferences to English for outputs of weekdays, months, etc.
Sys.setlocale("LC_TIME", "English")
[1] "English_United States.1252"
library(readr)
library(tidyverse)
library(janitor)
library(lubridate)
library(DataExplorer)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(VennDiagram)
Loading required package: grid
Loading required package: futile.logger
The next code import all the .csv files from a specific
directory into R
filename<-list.files(path="Fitabase Data 4.12.16-5.12.16/", pattern="*.csv")
for (i in 1:length(filename))
assign(filename[i], read.csv(paste("Fitabase Data 4.12.16-5.12.16/", filename[i], sep="")))
list.files(path="Fitabase Data 4.12.16-5.12.16/", pattern="*.csv")
[1] "dailyActivity_merged.csv" "dailyCalories_merged.csv" "dailyIntensities_merged.csv" "dailySteps_merged.csv"
[5] "heartrate_seconds_merged.csv" "hourlyCalories_merged.csv" "hourlyIntensities_merged.csv" "hourlySteps_merged.csv"
[9] "minuteCaloriesNarrow_merged.csv" "minuteCaloriesWide_merged.csv" "minuteIntensitiesNarrow_merged.csv" "minuteIntensitiesWide_merged.csv"
[13] "minuteMETsNarrow_merged.csv" "minuteSleep_merged.csv" "minuteStepsNarrow_merged.csv" "minuteStepsWide_merged.csv"
[17] "sleepDay_merged.csv" "weightLogInfo_merged.csv"
dailyActivity_merged.csv preview
The first exploration will be on the
dailyActivity_merged file. We want to explore the
structure of the data. As we can see, this table has 15 variables and a
total of 940 observations, where each observations correspond for a day
of a specific user (Id). The data in the file are all numerical except
for the column ActivityDate, which is a chr type.
str(dailyActivity_merged.csv)
'data.frame': 940 obs. of 15 variables:
$ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
$ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
$ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
$ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
$ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
$ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
$ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
$ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
$ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
$ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
$ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
$ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
$ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
$ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
dailyCalories_merged.csv preview
This table has 3 variables and a total of 940 observations, where
each observations correspond for a day of a specific user (Id) and the
total calories burner that day. So now we know that every table with 940
observations is a resume from each day activity from a specific user and
they all join in: “dailyActivity_merged.csv”.
str(dailyCalories_merged.csv)
'data.frame': 940 obs. of 3 variables:
$ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
$ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
heartrate_seconds_merged.csv preview
This tables has a lot of observations as you can see (2483658). The
first we observe is that the Id is different from the id of the others
table. So we group by id identify the Ids numbers and try to compare
with our previus data. We found that some Ids are missing to respect
from the others tables, here we only have 14 vs the 33 on the others
tables, this can be due that the individuals missing don’t have this
function active on her their device due to configuration issues.
str(heartrate_seconds_merged.csv)
'data.frame': 2483658 obs. of 3 variables:
$ Id : num 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
$ Time : chr "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
$ Value: int 97 102 105 103 101 95 91 93 94 93 ...
heartrate_seconds_merged.csv %>%
group_by(Id) %>%
summarise(count = n())%>%
nrow()
[1] 14
dailyActivity_merged.csv %>%
group_by(Id) %>%
summarise(count = n())%>%
nrow()
[1] 33
hourlyCalories_merged.csv preview
Now we want to undertsant the data inside the files with 22099
observations. This files contains the calories from a specific user in
intervals of 1 hours for each day. So the sum of the data here for each
day, should be equal to the data on the file with daily record. We gonna
make one validation for the calories, and as we can see there are some
incompatibiltys between the data fo the total calories on
dailyActivity_merged.csv and
hourlyCalories_merged.csv.
str(hourlyCalories_merged.csv)
'data.frame': 22099 obs. of 3 variables:
$ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ ActivityHour: chr "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
$ Calories : int 81 61 59 47 48 48 48 47 68 141 ...
verification_hourlyCalories <- hourlyCalories_merged.csv %>%
group_by(Id) %>%
arrange(Id) %>%
summarise(sum = sum(Calories))
verification_dailyActivity <- dailyActivity_merged.csv %>%
group_by(Id) %>%
arrange(Id) %>%
summarise(sum = sum(Calories))
verification_dailyActivity %>%
mutate(difference = verification_dailyActivity$sum - verification_hourlyCalories$sum) %>%
head()
minuteCaloriesNarrow_merged.csv preview
We want to explore where the inconsistency start. So we explore the
minutes files and compare between the hours file. So we see that we have
incompatibilitys also here.
verification_minuteCaloriesNarrow <- minuteCaloriesNarrow_merged.csv %>%
group_by(Id) %>%
arrange(Id) %>%
summarise(sum = sum(Calories))
verification_hourlyCalories %>%
mutate(difference = verification_minuteCaloriesNarrow $sum - verification_hourlyCalories$sum) %>%
head()
sleepDay_merged.csv and weightLogInfo_merged.csv
preview
Finally we are going to explore the last two tables,
sleepDay_merged.csv and weightLogInfo_merged.csv. We
can observ that the sleepDay table have information for each day for
each user id, however the rows are much less than the
dailyActivity_merged.csv file, so we invistigate this
descrepancy first by the number of user, and we see that some user dont
apper here.
str(sleepDay_merged.csv)
'data.frame': 413 obs. of 5 variables:
$ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
$ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
$ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
$ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
sleepDay_merged.csv %>%
group_by(Id)%>%
arrange(Id)%>%
summarise(count = n())%>%
nrow()
[1] 24
For the weightLogInfo_merged.csv we observe that only 8
users are on the report.
str(weightLogInfo_merged.csv)
'data.frame': 67 obs. of 8 variables:
$ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
$ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
$ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
$ WeightPounds : num 116 116 294 125 126 ...
$ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
$ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
$ IsManualReport: chr "True" "True" "False" "True" ...
$ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
unique(weightLogInfo_merged.csv[c("Id")])%>%
nrow()
[1] 8
Not keys was found about information such as: participants
demographic, age, gender, weather indicators. Unfortunately, this
associated with the small sample size would limit the scope of analysis
that can be performed.
Cleaning and Formatting Data-sets
We are going to combine all the data from
dailyActivity_merged.csv, sleepDay_merged.csv and
weightLogInfo_merged.csv in a single data-set. First we are
going to Cleaning and Formatting the Data-sets. We are going to make all
the variables lowercase through the function clean_names() from
the jupiter library. Also we are going to change the
name of all the dates variables to the name date in all
tables, and finally change the format of dates to year-month-day.
dailyActivity_clean <- dailyActivity_merged.csv %>%
clean_names() %>%
rename(date = activity_date)%>%
mutate(date = as.Date(date, format = "%m/%d/%Y"))
sleepDay_clean <- sleepDay_merged.csv %>%
clean_names() %>%
rename(date = sleep_day)%>%
mutate(date = as.Date(date, format = "%m/%d/%Y"))
weightLogInfo_clean <- weightLogInfo_merged.csv %>%
rename(date = Date)%>%
clean_names() %>%
mutate(date = as.Date(date, format = "%m/%d/%Y"))
Now we are going to prepare the data for a merge/join between tables,
so we need to clean the data from any duplicate and null value.
sum(duplicated(dailyActivity_clean))
[1] 0
sum(is.na (dailyActivity_clean))
[1] 0
sum(duplicated(sleepDay_clean))
[1] 3
sum(is.na (sleepDay_clean))
[1] 0
sum(duplicated(weightLogInfo_clean))
[1] 0
sum(is.na (weightLogInfo_clean))
[1] 65
So we have found that sleepDay_clean have duplicate values and there
are Null values in weightLogInfo_clean, however this are only for one
column (fat), so we are only going to clean for the duplicates.
sleepDay_clean <- sleepDay_clean %>%
distinct()
Finally we are going to merge all the data in one data-frame and
change the format of id from numeric to string for classify each user as
a categorie.
dailyActivity_join <- dailyActivity_clean %>%
left_join(sleepDay_clean, by = c("id", "date")) %>%
left_join(., weightLogInfo_clean, by = c("id", "date"))
#now we change the data type for the id column
dailyActivity_join$id <- as.character(dailyActivity_join$id)
head(dailyActivity_join)
We also are going to use the data on
hourlyCalories_merged.csv,
hourlyIntensities_merged.csv and
hourlySteps_merged.csv. We are just going to review for any
duplicate.
sum(duplicated(hourlyCalories_merged.csv))
[1] 0
sum(duplicated(hourlyIntensities_merged.csv))
[1] 0
sum(duplicated(hourlySteps_merged.csv))
[1] 0
Now we are going to format the hours and also clean the names.
hourlyCalories_clean <- hourlyCalories_merged.csv %>%
clean_names() %>%
rename(date_time = activity_hour)%>%
mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))
hourlyIntensities_clean <- hourlyIntensities_merged.csv %>%
clean_names() %>%
rename(date_time = activity_hour)%>%
mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))
hourlySteps_clean <- hourlySteps_merged.csv %>%
rename(date_time = ActivityHour)%>%
clean_names() %>%
mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))
Since we not found any duplicate, we are going to merge all the data
in one single file hourlyActivity_join
hourlyActivity_join <- hourlyCalories_clean %>%
inner_join(hourlyIntensities_clean, by = c("id", "date_time"))%>%
inner_join(.,hourlySteps_clean, by = c("id", "date_time"))
#We also going to separate the date form the hour for management facility
hourlyActivity_join <- hourlyActivity_join %>%
separate(date_time, into = c("date", "time"), sep= " ")%>%
#and we going to change the format of the hour to only show hour and minute
mutate(time = format(parse_date_time(as.character(time), "HMS"), format = "%H:%M"))
#now we change the data type for the id column
hourlyActivity_join$id <- as.character(hourlyActivity_join$id)
head(hourlyActivity_join)
Since we already merge our maindataframes, we can drop all the others
files form the R environment (for performance and cleanliness).
#First we list all the dataframes we have to visualizate them
ls()
[1] "avg_income_year" "dailyActivity_clean" "dailyActivity_join" "dailyActivity_merged.csv"
[5] "dailyCalories_merged.csv" "dailyIntensities_merged.csv" "dailySteps_merged.csv" "filename"
[9] "filterdata" "gss" "heartrate_seconds_merged.csv" "hourlyActivity_join"
[13] "hourlyCalories_clean" "hourlyCalories_merged.csv" "hourlyIntensities_clean" "hourlyIntensities_merged.csv"
[17] "hourlySteps_clean" "hourlySteps_merged.csv" "i" "minuteCaloriesNarrow_merged.csv"
[21] "minuteCaloriesWide_merged.csv" "minuteIntensitiesNarrow_merged.csv" "minuteIntensitiesWide_merged.csv" "minuteMETsNarrow_merged.csv"
[25] "minuteSleep_merged.csv" "minuteStepsNarrow_merged.csv" "minuteStepsWide_merged.csv" "sleepDay_clean"
[29] "sleepDay_merged.csv" "verification_dailyActivity" "verification_hourlyCalories" "verification_minuteCaloriesNarrow"
[33] "weightLogInfo_clean" "weightLogInfo_merged.csv"
#Now we drop all dataframes except the ones we create and will use on the future.
rm(list=setdiff(ls(), c("dailyActivity_join", 'hourlyActivity_join', 'dailyActivity_clean', 'sleepDay_clean', 'weightLogInfo_clean', 'heartrate_seconds_merged.csv')))
Finally, in reality we are not going to use all columns in
dailyActivity_join, so we can drop some columns (for
performance and cleanliness).
dailyActivity_join <- dailyActivity_join %>%
select(-c(total_distance,
tracker_distance,
logged_activities_distance,
very_active_distance,
moderately_active_distance,
light_active_distance,
sedentary_active_distance,
total_sleep_records,
total_time_in_bed,
weight_kg,
weight_pounds,
fat,
bmi,
is_manual_report,
log_id))
Normality Analyze of data frames
Here we are going to investigate the normality of the numerical data,
to know more about the limitations about our data. Lest start with the
variables inside dailyActivity_join:
#Here we going to use the library DataExplorer, since our data frame have some categorical variables and will be difficult to make a loop for ggplot2.
dailyActivity_join %>%
plot_histogram(
ncol = 3,
ggtheme = theme_light()
)

We can see that some variables have near a normal behavior with
little skew or abnormally values. i.e. calories,
total_minutes_aesleep, lightly_active_minutes and others have a
strong right skewed distributions i.e. fairly_active_minutes
and very_active_minutes.
Now we are analyze the data inside hourlyActivity_join:
hourlyActivity_join %>%
plot_histogram(
ncol = 3,
ggtheme = theme_light()
)

Here we can see that all variables are right skewed. This is related
to fact that most of the hours the people are going to be working or
sleeping, and since the intensity is low is normal to have a skewed plot
for the calories.
Data analyze
Distribution of the tracking of the devices
We are ready to make some questions from our Data. The first question
we want to investigate is:
- Which is the distribution of the usage of the apps on the
differents activities?
We already know the answer to this question thanks to the initial
exploration we did. We have 33 user that use her device to track her
daily activity, 24 users that track her sleep behavior, 8 users that
tracks her weight loss/gain and 14 users that track her heart rate. So
let put this information on a plot.
# We are going to plot a Venn diagram between the 4 file dailyActivity_clean, sleepDay_clean, weightLogInfo_clean and heartrate_seconds_merged.csv
#First we need to create the sets. We are going to create for each dataframe a set of unique Ids.
step_ids <- unique(dailyActivity_clean$id, incomparables = FALSE)
sleep_ids <- unique(sleepDay_clean$id, incomparables = FALSE)
heartrate_ids <- unique(heartrate_seconds_merged.csv$Id, incomparables = FALSE)
weight_ids <- unique(weightLogInfo_clean$id, incomparables = FALSE)
#now we create the graph, Frist we need a list vector.
x <- list(A=step_ids, B=sleep_ids, C=heartrate_ids, D=weight_ids)
#function to display Venn diagram inside markdown, for this we need to call the library VennDiagram
display_venn <- function(x, ...){
grid.newpage()
venn_object <- venn.diagram(x, filename = NULL, ...)
grid.draw(venn_object)
}
#display Venn diagram
display_venn(
x,
category.names = c("Steps count", "Sleep monitor", "Heart monitor", "Weight tracking"),
fill = c("#999999", "#E69F00", "#56B4E9", "#009E73")
)

Type of users per activity level
Here we will ascertain how often the participants use their smart
devices. With daily_activity, we will assume that days with < 200
TotalSteps taken, are days where users have not used their watches. We
will filter out these inactive day and assign the following
designations:
- Low Use - 1 to 5 days
- Moderate Use - 5 to 20 days
- High Use - 21 to 31 days
Breaking down the analysis further in this way will help us
understand the different trends underlying each Usage Groups.
#Here we create a table to classify the users according to the times they appear in the data frame
dailyActivity_join %>%
filter(total_steps > 200) %>%
group_by(id) %>%
summarize(count = n()) %>%
mutate(usage = ifelse(count <= 5, "Low use",
ifelse(count <= 20, "Moderate use",
ifelse(count <= 31, "High Use", NA))))%>%
#We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.
#the :: here call the library scales to use the function percent, since we only usign once, we dont need to load the library.
group_by(usage) %>%
summarise(total = n()) %>%
mutate(perc = total/sum(total))%>%
mutate(perc = scales::percent(perc)) %>%
#Now that we have our new table we can create our plot.
ggplot(aes(x = "", y = total, fill = usage )) +
geom_bar(stat='identity', width = 1) +
coord_polar("y", start=0)+
theme_void()+
theme(plot.title = element_text(hjust = 0.5, vjust= -5, size = 20, face = "bold")) +
geom_text(aes(label = perc, x = 1.25),position = position_stack(vjust = 0.5)) +
labs(title = "Usage Group Distribution") +
guides(fill = guide_legend(title = "Usage Type"))

Analyzing our results we can see that 63.6% of the users of our
sample use their device frequently almost very day - between 25 to 31
days, 27.3% use their device 15 to 25 days. 6.1% of our sample use their
device between 5 to 15 days and 3.0% use their devices very rarely.
Time used smart device and distribution
We will analyse the steps taken by users within and between groups
per day and hour. Lets start with the daily steps for each user between
groups.
#here we create a new column on our data frame with the classification we did before, since we are going to need it for the rest of the analyzes.
dailyActivity_join <- dailyActivity_join %>%
filter(total_steps > 200) %>%
group_by(id) %>%
mutate(count = n()) %>%
mutate(usage = ifelse(count <= 5, "Low use",
ifelse(count <= 20, "Moderate use",
ifelse(count <= 31, "High use", NA))), groups="drop") %>%
#We are going to organize the level in the order we want they appear on the plots.
mutate(usage = factor(usage, level = c('Low use','Moderate use','High use'))) %>%
#As we group and apply this to our main dataframe, we need to ungroup or we are going to get all values of summarize function grouping by id.
ungroup(id)
dailyActivity_join %>%
ggplot(aes(x = date, y = total_steps, group = id, color = id)) +
geom_line() +
theme(legend.position = "none")+
facet_wrap(~usage, ncol = 1)

There is not specific trend here, since some very High use users have
some days with low total steps. Now we are going to plot the average
steps by day of each group.
dailyActivity_join %>%
group_by(usage, date) %>%
summarize(average_steps = mean(total_steps)) %>%
ggplot(aes(x = date, y = average_steps, fill = usage, color = usage)) +
geom_col()+
facet_wrap(~usage)
`summarise()` has grouped output by 'usage'. You can override using the `.groups` argument.

Now we going to visualizate this on a better manner trough a boxplot
diagram.
dailyActivity_join %>%
group_by(usage, date) %>%
summarize(average_steps = mean(total_steps)) %>%
ggplot(aes(x = usage, y = average_steps, fill = usage, color = usage)) +
geom_boxplot()
`summarise()` has grouped output by 'usage'. You can override using the `.groups` argument.

Finally, we are going to plot the average use of the devices per week
day.
#First we create and column that containt each weekday
dailyActivity_join %>%
mutate(weekday = weekdays(as.Date(date)),
weekday = fct_relevel(weekday, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>%
#Now we group by usage and weekday, get the average, the confidence interval and finally we plot.
group_by(weekday, usage) %>%
summarize(average_steps = mean(total_steps), ci = qt(0.975, n())*sd(total_steps)/sqrt(n()))%>%
ggplot(aes(x = weekday, y = average_steps, fill = usage, color = usage)) +
geom_col()+
#code for add intervals of confidence
#geom_errorbar(aes(ymin = average_steps - ci, ymax = average_steps + ci), width = 0.2, colour = 'black') +
facet_wrap(~usage, ncol=1)
`summarise()` has grouped output by 'weekday'. You can override using the `.groups` argument.

We can see some patrons from our data:
Average steps per day increases as usage of devices increases, we
are going to invistigate more on this in the next section.
For moderate and high use users, there is not a clear day that
show a higher mean than the other days (is necessary to do a t-test,
however you need to be aware that data is not independent within groups
and between groups).
Low use users (1 individue) does not seem to display any
difference on the mean against the moderate use users.
Usage during the day (a more in deep analysis)
Now that we have some trends of usage, we want to the distribution of
usage during the day of the devices, and how this is correlate to some
activities. For this we are going to be working with the
hourlyActivity_join table. The first we are going to
investigate is the distribution usage of the devices during each day of
the week for each group.
#Since this is other data frame, we need to make the classification again, first we sum the values of total steps per day to filter bt values > 200 on the other step
hourlyActivity_join <- hourlyActivity_join %>%
group_by(id, day(date)) %>%
rename(day = "day(date)") %>%
mutate(total_steps = sum(step_total)) %>%
ungroup(id, day)
#now we sum the days a user use the devices and make the categorization.
hourlyActivity_join <- hourlyActivity_join %>%
filter(total_steps > 200) %>%
group_by(id) %>%
mutate(days_usage = n_distinct(day(date))) %>%
mutate(usage = ifelse(days_usage <= 5, "Low use",
ifelse(days_usage <= 20, "Moderate use",
ifelse(days_usage <= 31, "High use", NA)))) %>%
#We are going to organize the level in the order we want they appear on the plots.
mutate(usage = factor(usage, level = c('Low use','Moderate use','High use'))) %>%
ungroup(id)
#Now we plot
hourlyActivity_join %>%
mutate(weekday = format(ymd(date), format = '%a'),
weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
group_by(time, weekday, usage) %>%
summarize(average_steps = mean(step_total)) %>%
ggplot(aes(x = time, y = average_steps, fill = average_steps)) +
viridis::scale_fill_viridis(option = "D")+
geom_col()+
facet_grid(usage~weekday)+
theme(axis.text.x = element_text(size = 5, angle = 90))
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

#We also going to make a heat plot for the same distribution to have other options for presentation.
hourlyActivity_join %>%
mutate(weekday = format(ymd(date), format = '%a'),
weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
group_by(weekday,time, usage) %>%
summarize(average_steps = mean(step_total)) %>%
ggplot(aes(x = time, y = weekday, fill = average_steps)) +
viridis::scale_fill_viridis(option = "D")+
geom_tile()+
geom_text(aes(label = round(average_steps, digits = 0)), color = "black", size = 2.0) +
facet_wrap(~usage, ncol=1)+
theme(axis.text.x = element_text(size = 5, angle = 90))
`summarise()` has grouped output by 'weekday', 'time'. You can override using the `.groups` argument.

We can see some patrons from our data:
The high use users start their day an hour earlier (6:00AM)
compared to other groups and end her day and hour later (22:00 PM).
During the weekdays the peaks are between, 5:00 to 8:00 PM, suggesting
habitual excercise as work ends.
Moderate Use users display peaks in their steps the Saturdays and
Sundays, between 8:00 AM to 12:00 PM.
More specfic questions about the data
As we see on the last part, there are some hours where the users have
some peaks, we want to investigate is this is related with Exercise
sessions (we can go to gym and just do weight or we can spend some time
doing cardio on a treadmill). For this we are going to investigate the
intensity variable and we want to response some questions:
- What are the relation between intensity and average
steps?
#First we are going to plot the distribution of intensity between the days.
hourlyActivity_join %>%
mutate(weekday = format(ymd(date), format = '%a'),
weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
group_by(time, weekday, usage) %>%
summarize(average_intensity_hour = mean(average_intensity))%>%
ggplot(aes(x = time, y = average_intensity_hour, fill = average_intensity_hour)) +
viridis::scale_fill_viridis(option = "inferno")+
geom_col()+
facet_grid(usage~weekday)+
theme(axis.text.x = element_text(size = 5, angle = 90))
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

We can see that the plots are very similar between average_intensity
and the average_steps per group. So we will expect a linear correlation
between both of this variables.
hourlyActivity_join %>%
mutate(weekday = weekdays(as.Date(date))) %>%
group_by(time, weekday, usage) %>%
summarize(average_intensity_hour = mean(average_intensity), average_steps = mean(step_total) )%>%
ggplot(aes(x = average_intensity_hour, y = average_steps)) +
geom_point()+
geom_smooth()+
facet_wrap(~usage)
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

So as we expected, we have a positive correlation (almost linear)
between the average intensity per hour and average steps per hour. So we
can associate the high steps to sessions of exercise where the users is
very active. Lets also investigate the correlation between the variables
and average calories burner.
hourlyActivity_join %>%
mutate(weekday = weekdays(as.Date(date))) %>%
group_by(time, weekday, usage) %>%
summarize(average_intensity_hour = mean(average_intensity), average_steps = mean(step_total), average_calories = mean(calories)) %>%
GGally::ggpairs(columns = c(4,5,6))
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

We can see here a strong correlation between this variables. This is
expected, since as we saw before, the high steps are generally
associated with high intensity exercise sessions, where the user will
tend to burn more calories.
What about sleep behaviur?
Another variable will be interesting to analyze is the sleep
behavior.We want to investigate how is the sleep behavior from users
according to their active level.
- What are the relation between active level and sleep
hours?
#First we are going to plot the distribution of sleep between the groups.
dailyActivity_join %>%
group_by(date, usage) %>%
summarize(average_sleep_minutes = mean(total_minutes_asleep, na.rm=TRUE)) %>%
ggplot(aes(x = usage, y = average_sleep_minutes, fill = usage)) +
geom_boxplot()
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

Here we see that the users have almost the same mean, independent
from there usage group, the missing value on the los use group is due
that the only user we have on this group don’t have any data about his
sleep behavior. We are going to going more in deep making a
classification for the time slept:
- Bad sleep - slept less than 300 minutes
- Normal Sleep - slept between 300 and 480 minutes
- Over Sleep - slept more than 480 minutes
dailyActivity_join <- dailyActivity_join %>%
mutate(sleep_type = ifelse(total_minutes_asleep<= 300, "Bad sleep",
ifelse(total_minutes_asleep <= 480, "Normal sleep",
ifelse(total_minutes_asleep > 480, "Over sleep", NA))),
sleep_type = factor(sleep_type, level = c('Bad sleep','Normal sleep','Over sleep')))
dailyActivity_join %>%
group_by(sleep_type, id) %>%
summarize(count_sleep = n()) %>%
drop_na() %>%
summarize(total_sleep_type = n()) %>%
mutate(perc = total_sleep_type/sum(total_sleep_type))%>%
mutate(perc = scales::percent(perc)) %>%
ggplot(aes(x = "", y = total_sleep_type, fill = sleep_type)) +
geom_bar(stat='identity', width = 1) +
coord_polar("y", start=0)+
theme_void()+
theme(plot.title = element_text(hjust = 0.5, vjust= -5, size = 20, face = "bold")) +
geom_text(aes(label = perc, x = 1.2),position = position_stack(vjust = 0.5)) +
labs(title = "Sleep Type Distribution") +
guides(fill = guide_legend(title = "sleep Type"))
`summarise()` has grouped output by 'sleep_type'. You can override using the `.groups` argument.

#We can also visualizate this distribution through the different Usage groups.
dailyActivity_join %>%
group_by(usage, sleep_type, id) %>%
summarize(count_sleep = n()) %>%
drop_na() %>%
summarize(total_sleep_type = n()) %>%
mutate(perc = total_sleep_type/sum(total_sleep_type))%>%
mutate(perc = scales::percent(perc)) %>%
ggplot(aes(x = "", y = total_sleep_type, fill = sleep_type)) +
geom_bar(stat='identity', width = 1, position = "fill") +
coord_polar("y", start=0)+
theme_void()+
theme(plot.title = element_text(hjust = 0.5, vjust= 5, size = 20, face = "bold")) +
geom_text(aes(label = perc, x=1.2), position = position_fill(vjust = 0.5)) +
labs(title = "Sleep Type Distribution") +
guides(fill = guide_legend(title = "Sleep Type"))+
facet_wrap(~usage, strip.position = "bottom")
`summarise()` has grouped output by 'usage', 'sleep_type'. You can override using the `.groups` argument.`summarise()` has grouped output by 'usage'. You can override using the `.groups` argument.

Analyzing our results we can see that 26.4% of the times of user
reports a bad sleep, 35.8% of the times they have a normal sleep and
35.8% of the times the over sleep. Through the groups we can see that
the distribution is near similar to the global. Note that an user can
have one day of each category.
Finally, we are going to relate the sleep behavior against the active
level, And we are going to classify our users according to their mean
active level. This classification will be based on the average active
level of each user against the average active level of all users i.e. if
an user has her sedentary average greater than the global sedentary
average, this user will be classificate as sedentary. Finally if an user
isnt in any categorie, we will exclude from the data.
#we need to make a classification for the active level of the users. First we are going to get the average of all users.
#And we are going to drop the 0 values making them NA values and ignoring them on the mean calculation
temp <- dailyActivity_join %>%
na_if(0) %>%
mutate(sedentary_minutes_avg = mean(sedentary_minutes, na.rm = TRUE),
lightly_active_minutes_avg = mean(lightly_active_minutes, na.rm = TRUE),
fairly_active_minutes_avg = mean(fairly_active_minutes, na.rm = TRUE),
very_active_minutes_avg = mean(very_active_minutes, na.rm = TRUE)) %>%
#We are going to replace NA values with 0 to avoid errors in our categorization. After we gonna make the classifiaction using the statement case_when
mutate(sedentary_minutes = replace(sedentary_minutes,is.na(sedentary_minutes),0),
lightly_active_minutes = replace(lightly_active_minutes,is.na(lightly_active_minutes),0),
fairly_active_minutes = replace(fairly_active_minutes,is.na(fairly_active_minutes),0),
very_active_minutes = replace( very_active_minutes,is.na( very_active_minutes),0)) %>%
mutate(active_type = factor(case_when(sedentary_minutes > sedentary_minutes_avg &
lightly_active_minutes < lightly_active_minutes_avg &
fairly_active_minutes< fairly_active_minutes_avg &
very_active_minutes < very_active_minutes_avg ~ "Sedentary",
lightly_active_minutes > lightly_active_minutes_avg &
fairly_active_minutes < fairly_active_minutes_avg &
very_active_minutes < very_active_minutes_avg ~ "Lightly Active",
fairly_active_minutes > fairly_active_minutes_avg &
very_active_minutes < very_active_minutes_avg ~ 'Fairly Active',
very_active_minutes > very_active_minutes_avg ~ 'Very Active'),
levels=c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")))%>%
drop_na(sleep_type, active_type)
#finally we plot.
temp %>%
ggplot(aes(x = active_type, fill = sleep_type)) +
geom_bar(position = "fill") +
labs(y = "Proportion")

temp %>%
ggplot(aes(x = active_type, fill = sleep_type)) +
geom_bar(position = "fill") +
labs(y = "Proportion")+
facet_wrap(~usage)

Analyzing our results we can see that Sedentary people tend to have a
bad sleep behavior. We can also observe that a little activity on the
day will tend to a normal sleep. Also as active level increase the
oversleep behavior decreace,
