Summary

Bellabeat is a high-tech company that manufactures health-focused smart products.They offer different smart devices that collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.

The main focus of this case was to analyze smart devices fitness data and determined how it could help unlock new growth opportunities for Bellabeat. We used information from FitBit-app to gain insight into how consumers use non-Bellabeat smart devices and to improve our product Leaf smart.

The FitBit Fitness Tracker Data

The FitBit Fitness Tracker dataset is a public data set(CC0: Public Domain) that is made available on Kaggle through the user Mobius.

This dataset is generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016 (1 month period). Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

The dataset is a collection of 18 .csv files. 15 in long format, 3 in wide format. The datasets consists of wide-ranging information from activity metrics, calories, sleep records, metabolic equivalent of tasks (METs), heart rate and steps; in timeframes of seconds, minutes, hours and days. There was no metadata provided.

Preview of the Data

We first want to preview some information about all the data we have, so we can establish a working plan. We are going to load all the tables and make an exploration from the structure of every table. We are going to use the readr R package to import the excel data into the r workspace. Also we are going to load some others library that will be use during the process.

#We change our system preferences to English for outputs of weekdays, months, etc.
Sys.setlocale("LC_TIME", "English")
[1] "English_United States.1252"
library(readr)
library(tidyverse)
library(janitor)
library(lubridate)
library(DataExplorer)
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
library(VennDiagram)
Loading required package: grid
Loading required package: futile.logger

The next code import all the .csv files from a specific directory into R

filename<-list.files(path="Fitabase Data 4.12.16-5.12.16/", pattern="*.csv")

for (i in 1:length(filename)) 
  assign(filename[i], read.csv(paste("Fitabase Data 4.12.16-5.12.16/", filename[i], sep="")))

list.files(path="Fitabase Data 4.12.16-5.12.16/", pattern="*.csv")
 [1] "dailyActivity_merged.csv"           "dailyCalories_merged.csv"           "dailyIntensities_merged.csv"        "dailySteps_merged.csv"             
 [5] "heartrate_seconds_merged.csv"       "hourlyCalories_merged.csv"          "hourlyIntensities_merged.csv"       "hourlySteps_merged.csv"            
 [9] "minuteCaloriesNarrow_merged.csv"    "minuteCaloriesWide_merged.csv"      "minuteIntensitiesNarrow_merged.csv" "minuteIntensitiesWide_merged.csv"  
[13] "minuteMETsNarrow_merged.csv"        "minuteSleep_merged.csv"             "minuteStepsNarrow_merged.csv"       "minuteStepsWide_merged.csv"        
[17] "sleepDay_merged.csv"                "weightLogInfo_merged.csv"          

dailyActivity_merged.csv preview

The first exploration will be on the dailyActivity_merged file. We want to explore the structure of the data. As we can see, this table has 15 variables and a total of 940 observations, where each observations correspond for a day of a specific user (Id). The data in the file are all numerical except for the column ActivityDate, which is a chr type.

str(dailyActivity_merged.csv)
'data.frame':   940 obs. of  15 variables:
 $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
 $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
 $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
 $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
 $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
 $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
 $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
 $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
 $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
 $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
 $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
 $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

dailyCalories_merged.csv preview

This table has 3 variables and a total of 940 observations, where each observations correspond for a day of a specific user (Id) and the total calories burner that day. So now we know that every table with 940 observations is a resume from each day activity from a specific user and they all join in: “dailyActivity_merged.csv”.

str(dailyCalories_merged.csv)
'data.frame':   940 obs. of  3 variables:
 $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
 $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

heartrate_seconds_merged.csv preview

This tables has a lot of observations as you can see (2483658). The first we observe is that the Id is different from the id of the others table. So we group by id identify the Ids numbers and try to compare with our previus data. We found that some Ids are missing to respect from the others tables, here we only have 14 vs the 33 on the others tables, this can be due that the individuals missing don’t have this function active on her their device due to configuration issues.

str(heartrate_seconds_merged.csv)
'data.frame':   2483658 obs. of  3 variables:
 $ Id   : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
 $ Time : chr  "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
 $ Value: int  97 102 105 103 101 95 91 93 94 93 ...
heartrate_seconds_merged.csv %>%
  group_by(Id) %>%
  summarise(count = n())%>%
  nrow()
[1] 14
dailyActivity_merged.csv %>%
  group_by(Id) %>%
  summarise(count = n())%>%
  nrow()
[1] 33

hourlyCalories_merged.csv preview

Now we want to undertsant the data inside the files with 22099 observations. This files contains the calories from a specific user in intervals of 1 hours for each day. So the sum of the data here for each day, should be equal to the data on the file with daily record. We gonna make one validation for the calories, and as we can see there are some incompatibiltys between the data fo the total calories on dailyActivity_merged.csv and hourlyCalories_merged.csv.

str(hourlyCalories_merged.csv)
'data.frame':   22099 obs. of  3 variables:
 $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
 $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...
verification_hourlyCalories <- hourlyCalories_merged.csv %>%
  group_by(Id) %>%
  arrange(Id) %>%
  summarise(sum = sum(Calories))

verification_dailyActivity <- dailyActivity_merged.csv %>%
  group_by(Id) %>%
  arrange(Id) %>%
  summarise(sum = sum(Calories))

verification_dailyActivity %>%
  mutate(difference = verification_dailyActivity$sum - verification_hourlyCalories$sum) %>% 
  head()

minuteCaloriesNarrow_merged.csv preview

We want to explore where the inconsistency start. So we explore the minutes files and compare between the hours file. So we see that we have incompatibilitys also here.

verification_minuteCaloriesNarrow <- minuteCaloriesNarrow_merged.csv %>%
  group_by(Id) %>%
  arrange(Id) %>%
  summarise(sum = sum(Calories))

verification_hourlyCalories %>%
  mutate(difference = verification_minuteCaloriesNarrow $sum - verification_hourlyCalories$sum) %>% 
  head()

sleepDay_merged.csv and weightLogInfo_merged.csv preview

Finally we are going to explore the last two tables, sleepDay_merged.csv and weightLogInfo_merged.csv. We can observ that the sleepDay table have information for each day for each user id, however the rows are much less than the dailyActivity_merged.csv file, so we invistigate this descrepancy first by the number of user, and we see that some user dont apper here.

str(sleepDay_merged.csv)
'data.frame':   413 obs. of  5 variables:
 $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
 $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
 $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
sleepDay_merged.csv %>%
  group_by(Id)%>%
  arrange(Id)%>%
  summarise(count = n())%>%
  nrow()
[1] 24

For the weightLogInfo_merged.csv we observe that only 8 users are on the report.

str(weightLogInfo_merged.csv)
'data.frame':   67 obs. of  8 variables:
 $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
 $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
 $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
 $ WeightPounds  : num  116 116 294 125 126 ...
 $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
 $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
 $ IsManualReport: chr  "True" "True" "False" "True" ...
 $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
unique(weightLogInfo_merged.csv[c("Id")])%>% 
  nrow()
[1] 8

Not keys was found about information such as: participants demographic, age, gender, weather indicators. Unfortunately, this associated with the small sample size would limit the scope of analysis that can be performed.

Cleaning and Formatting Data-sets

We are going to combine all the data from dailyActivity_merged.csv, sleepDay_merged.csv and weightLogInfo_merged.csv in a single data-set. First we are going to Cleaning and Formatting the Data-sets. We are going to make all the variables lowercase through the function clean_names() from the jupiter library. Also we are going to change the name of all the dates variables to the name date in all tables, and finally change the format of dates to year-month-day.

dailyActivity_clean <- dailyActivity_merged.csv %>%
 clean_names() %>%
 rename(date = activity_date)%>%
 mutate(date = as.Date(date, format = "%m/%d/%Y"))

sleepDay_clean <- sleepDay_merged.csv %>%
 clean_names() %>%
 rename(date = sleep_day)%>%
 mutate(date = as.Date(date, format = "%m/%d/%Y"))


weightLogInfo_clean <- weightLogInfo_merged.csv %>%
 rename(date = Date)%>%
 clean_names() %>%
 mutate(date = as.Date(date, format = "%m/%d/%Y"))

Now we are going to prepare the data for a merge/join between tables, so we need to clean the data from any duplicate and null value.

sum(duplicated(dailyActivity_clean))
[1] 0
sum(is.na (dailyActivity_clean))
[1] 0
sum(duplicated(sleepDay_clean))
[1] 3
sum(is.na (sleepDay_clean))
[1] 0
sum(duplicated(weightLogInfo_clean))
[1] 0
sum(is.na (weightLogInfo_clean))
[1] 65

So we have found that sleepDay_clean have duplicate values and there are Null values in weightLogInfo_clean, however this are only for one column (fat), so we are only going to clean for the duplicates.

sleepDay_clean <- sleepDay_clean %>%
  distinct()

Finally we are going to merge all the data in one data-frame and change the format of id from numeric to string for classify each user as a categorie.

dailyActivity_join <- dailyActivity_clean %>%
  left_join(sleepDay_clean, by = c("id", "date")) %>%
  left_join(., weightLogInfo_clean, by = c("id", "date")) 

#now we change the data type for the id column

dailyActivity_join$id <- as.character(dailyActivity_join$id)

head(dailyActivity_join)

We also are going to use the data on hourlyCalories_merged.csv, hourlyIntensities_merged.csv and hourlySteps_merged.csv. We are just going to review for any duplicate.

sum(duplicated(hourlyCalories_merged.csv))
[1] 0
sum(duplicated(hourlyIntensities_merged.csv))
[1] 0
sum(duplicated(hourlySteps_merged.csv))
[1] 0

Now we are going to format the hours and also clean the names.

hourlyCalories_clean <- hourlyCalories_merged.csv %>%
 clean_names() %>%
 rename(date_time = activity_hour)%>%
 mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))

hourlyIntensities_clean <- hourlyIntensities_merged.csv %>%
 clean_names() %>%
 rename(date_time = activity_hour)%>%
 mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))


hourlySteps_clean <- hourlySteps_merged.csv %>%
 rename(date_time = ActivityHour)%>%
 clean_names() %>%
 mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))

Since we not found any duplicate, we are going to merge all the data in one single file hourlyActivity_join

hourlyActivity_join <- hourlyCalories_clean %>%
  inner_join(hourlyIntensities_clean, by = c("id", "date_time"))%>%
  inner_join(.,hourlySteps_clean, by = c("id", "date_time"))

#We also going to separate the date form the hour for management facility
  hourlyActivity_join <- hourlyActivity_join %>%
  separate(date_time, into = c("date", "time"), sep= " ")%>%
  
#and we going to change the format of the hour to only show hour and minute
  mutate(time = format(parse_date_time(as.character(time), "HMS"), format = "%H:%M"))

#now we change the data type for the id column
hourlyActivity_join$id <- as.character(hourlyActivity_join$id)

head(hourlyActivity_join)

Since we already merge our maindataframes, we can drop all the others files form the R environment (for performance and cleanliness).

#First we list all the dataframes we have to visualizate them
ls()
 [1] "avg_income_year"                    "dailyActivity_clean"                "dailyActivity_join"                 "dailyActivity_merged.csv"          
 [5] "dailyCalories_merged.csv"           "dailyIntensities_merged.csv"        "dailySteps_merged.csv"              "filename"                          
 [9] "filterdata"                         "gss"                                "heartrate_seconds_merged.csv"       "hourlyActivity_join"               
[13] "hourlyCalories_clean"               "hourlyCalories_merged.csv"          "hourlyIntensities_clean"            "hourlyIntensities_merged.csv"      
[17] "hourlySteps_clean"                  "hourlySteps_merged.csv"             "i"                                  "minuteCaloriesNarrow_merged.csv"   
[21] "minuteCaloriesWide_merged.csv"      "minuteIntensitiesNarrow_merged.csv" "minuteIntensitiesWide_merged.csv"   "minuteMETsNarrow_merged.csv"       
[25] "minuteSleep_merged.csv"             "minuteStepsNarrow_merged.csv"       "minuteStepsWide_merged.csv"         "sleepDay_clean"                    
[29] "sleepDay_merged.csv"                "verification_dailyActivity"         "verification_hourlyCalories"        "verification_minuteCaloriesNarrow" 
[33] "weightLogInfo_clean"                "weightLogInfo_merged.csv"          
#Now we drop all dataframes except  the ones we create and will use on the future.
rm(list=setdiff(ls(), c("dailyActivity_join", 'hourlyActivity_join', 'dailyActivity_clean', 'sleepDay_clean', 'weightLogInfo_clean', 'heartrate_seconds_merged.csv')))

Finally, in reality we are not going to use all columns in dailyActivity_join, so we can drop some columns (for performance and cleanliness).

dailyActivity_join <- dailyActivity_join %>% 
  select(-c(total_distance,
            tracker_distance,
            logged_activities_distance, 
            very_active_distance, 
            moderately_active_distance, 
            light_active_distance,
            sedentary_active_distance, 
            total_sleep_records,  
            total_time_in_bed,  
            weight_kg,
            weight_pounds,
            fat,
            bmi,
            is_manual_report,
            log_id))

Normality Analyze of data frames

Here we are going to investigate the normality of the numerical data, to know more about the limitations about our data. Lest start with the variables inside dailyActivity_join:

#Here we going to use the library  DataExplorer, since our data frame have some categorical variables and will be difficult to make a loop for ggplot2.
dailyActivity_join %>%
  plot_histogram( 
    ncol = 3,
    ggtheme = theme_light()
    )

We can see that some variables have near a normal behavior with little skew or abnormally values. i.e. calories, total_minutes_aesleep, lightly_active_minutes and others have a strong right skewed distributions i.e. fairly_active_minutes and very_active_minutes.

Now we are analyze the data inside hourlyActivity_join:

hourlyActivity_join %>%
  plot_histogram( 
    ncol = 3,
    ggtheme = theme_light()
    )

Here we can see that all variables are right skewed. This is related to fact that most of the hours the people are going to be working or sleeping, and since the intensity is low is normal to have a skewed plot for the calories.

Data analyze

Distribution of the tracking of the devices

We are ready to make some questions from our Data. The first question we want to investigate is:

  • Which is the distribution of the usage of the apps on the differents activities?

We already know the answer to this question thanks to the initial exploration we did. We have 33 user that use her device to track her daily activity, 24 users that track her sleep behavior, 8 users that tracks her weight loss/gain and 14 users that track her heart rate. So let put this information on a plot.

# We are going to plot a Venn diagram between the 4 file dailyActivity_clean, sleepDay_clean, weightLogInfo_clean and heartrate_seconds_merged.csv
#First we need to create the sets. We are going to create for each dataframe a set of unique Ids.

step_ids <- unique(dailyActivity_clean$id, incomparables = FALSE)
sleep_ids <- unique(sleepDay_clean$id, incomparables = FALSE)
heartrate_ids <- unique(heartrate_seconds_merged.csv$Id, incomparables = FALSE)
weight_ids <- unique(weightLogInfo_clean$id, incomparables = FALSE)

#now we create the graph, Frist we need a list vector.
x <- list(A=step_ids, B=sleep_ids, C=heartrate_ids, D=weight_ids)

#function to display Venn diagram inside markdown, for this we need to call the library VennDiagram
display_venn <- function(x, ...){
  grid.newpage()
  venn_object <- venn.diagram(x, filename = NULL, ...)
  grid.draw(venn_object)
}

#display Venn diagram
display_venn(
  x,
  category.names = c("Steps count", "Sleep monitor", "Heart monitor", "Weight tracking"),
  fill = c("#999999", "#E69F00", "#56B4E9", "#009E73")
  )

Type of users per activity level

Here we will ascertain how often the participants use their smart devices. With daily_activity, we will assume that days with < 200 TotalSteps taken, are days where users have not used their watches. We will filter out these inactive day and assign the following designations:

  • Low Use - 1 to 5 days
  • Moderate Use - 5 to 20 days
  • High Use - 21 to 31 days

Breaking down the analysis further in this way will help us understand the different trends underlying each Usage Groups.

#Here we create a table to classify the users according to the times they appear in the data frame
dailyActivity_join %>%
  filter(total_steps > 200) %>%
  group_by(id) %>%
  summarize(count = n()) %>%
  mutate(usage =  ifelse(count <= 5,  "Low use", 
                        ifelse(count <= 20,  "Moderate use", 
                        ifelse(count <= 31,  "High Use", NA))))%>%

#We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.
#the :: here call the library scales to use the function percent, since we only usign once, we dont need to load the library.
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(perc = total/sum(total))%>%
  mutate(perc = scales::percent(perc)) %>% 

#Now that we have our new table we can create our plot.
  ggplot(aes(x = "", y = total, fill = usage )) +
    geom_bar(stat='identity', width = 1) +
    coord_polar("y", start=0)+
    theme_void()+
    theme(plot.title = element_text(hjust = 0.5, vjust= -5, size = 20, face = "bold")) +
    geom_text(aes(label = perc, x = 1.25),position = position_stack(vjust = 0.5)) +
    labs(title = "Usage Group Distribution") +
    guides(fill = guide_legend(title = "Usage Type"))

Analyzing our results we can see that 63.6% of the users of our sample use their device frequently almost very day - between 25 to 31 days, 27.3% use their device 15 to 25 days. 6.1% of our sample use their device between 5 to 15 days and 3.0% use their devices very rarely.

Time used smart device and distribution

We will analyse the steps taken by users within and between groups per day and hour. Lets start with the daily steps for each user between groups.

#here we create a new column on our data frame with the classification we did before, since we are going to need it for the rest of the analyzes.
dailyActivity_join <- dailyActivity_join %>%
  filter(total_steps > 200) %>%
  group_by(id) %>%
  mutate(count = n()) %>%
  mutate(usage =  ifelse(count <= 5,  "Low use",
                        ifelse(count <= 20,  "Moderate use",
                        ifelse(count <= 31,  "High use", NA))), groups="drop") %>%

#We are going to organize the level in the order we want they appear on the plots.
 mutate(usage = factor(usage, level = c('Low use','Moderate use','High use'))) %>% 
  
#As we group and apply this to our main dataframe, we need to ungroup or we are going to get all values of summarize function grouping by id.
 ungroup(id)
dailyActivity_join %>%
  ggplot(aes(x = date, y = total_steps, group = id, color = id)) +
    geom_line() +
    theme(legend.position = "none")+
    facet_wrap(~usage, ncol = 1)

There is not specific trend here, since some very High use users have some days with low total steps. Now we are going to plot the average steps by day of each group.

dailyActivity_join %>%
  group_by(usage, date) %>% 
  summarize(average_steps = mean(total_steps)) %>%
  ggplot(aes(x = date, y = average_steps, fill = usage,  color = usage)) + 
     geom_col()+
     facet_wrap(~usage)
`summarise()` has grouped output by 'usage'. You can override using the `.groups` argument.

Now we going to visualizate this on a better manner trough a boxplot diagram.

dailyActivity_join %>%
  group_by(usage, date) %>% 
  summarize(average_steps = mean(total_steps)) %>%
  ggplot(aes(x = usage, y = average_steps, fill = usage,  color = usage)) + 
     geom_boxplot()
`summarise()` has grouped output by 'usage'. You can override using the `.groups` argument.

Finally, we are going to plot the average use of the devices per week day.

#First we create and column that containt each weekday
dailyActivity_join %>%
  mutate(weekday = weekdays(as.Date(date)), 
         weekday = fct_relevel(weekday, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>% 

#Now we group by usage and weekday, get the average, the confidence interval and finally we plot.
  group_by(weekday, usage) %>% 
  summarize(average_steps = mean(total_steps), ci = qt(0.975, n())*sd(total_steps)/sqrt(n()))%>%
  ggplot(aes(x = weekday, y = average_steps, fill = usage,  color = usage)) +
     geom_col()+
#code for add intervals of confidence  
     #geom_errorbar(aes(ymin =  average_steps - ci, ymax =  average_steps + ci), width = 0.2, colour = 'black') +
     facet_wrap(~usage, ncol=1)
`summarise()` has grouped output by 'weekday'. You can override using the `.groups` argument.

We can see some patrons from our data:

  • Average steps per day increases as usage of devices increases, we are going to invistigate more on this in the next section.

  • For moderate and high use users, there is not a clear day that show a higher mean than the other days (is necessary to do a t-test, however you need to be aware that data is not independent within groups and between groups).

  • Low use users (1 individue) does not seem to display any difference on the mean against the moderate use users.

Usage during the day (a more in deep analysis)

Now that we have some trends of usage, we want to the distribution of usage during the day of the devices, and how this is correlate to some activities. For this we are going to be working with the hourlyActivity_join table. The first we are going to investigate is the distribution usage of the devices during each day of the week for each group.

#Since this is other data frame, we need to make the classification again, first we sum the values of total steps per day to filter bt values > 200 on the other step
hourlyActivity_join <- hourlyActivity_join %>%
  group_by(id, day(date)) %>%
  rename(day = "day(date)") %>%
  mutate(total_steps = sum(step_total)) %>% 
  ungroup(id, day)
#now we sum the days a user use the devices and make the categorization.
hourlyActivity_join <- hourlyActivity_join %>%
  filter(total_steps > 200) %>%
  group_by(id) %>% 
  mutate(days_usage = n_distinct(day(date))) %>% 
  mutate(usage =  ifelse(days_usage <= 5,  "Low use", 
                        ifelse(days_usage <= 20,  "Moderate use", 
                        ifelse(days_usage  <= 31,  "High use", NA)))) %>% 
#We are going to organize the level in the order we want they appear on the plots.
  mutate(usage = factor(usage, level = c('Low use','Moderate use','High use'))) %>% 
  ungroup(id)
#Now we plot
hourlyActivity_join  %>%
  mutate(weekday = format(ymd(date), format = '%a'), 
         weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_steps = mean(step_total)) %>% 
  ggplot(aes(x = time, y = average_steps, fill = average_steps)) +
    viridis::scale_fill_viridis(option = "D")+
    geom_col()+
    facet_grid(usage~weekday)+
    theme(axis.text.x = element_text(size = 5, angle = 90))
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

#We also going to make a heat plot for the same distribution to have other options for presentation.
hourlyActivity_join %>%
  mutate(weekday = format(ymd(date), format = '%a'), 
         weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
  group_by(weekday,time, usage) %>% 
  summarize(average_steps = mean(step_total)) %>% 
  ggplot(aes(x = time, y = weekday, fill = average_steps)) +
    viridis::scale_fill_viridis(option = "D")+
    geom_tile()+
    geom_text(aes(label = round(average_steps, digits = 0)), color = "black", size = 2.0) +
    facet_wrap(~usage, ncol=1)+
    theme(axis.text.x = element_text(size = 5, angle = 90))
`summarise()` has grouped output by 'weekday', 'time'. You can override using the `.groups` argument.

We can see some patrons from our data:

  • The high use users start their day an hour earlier (6:00AM) compared to other groups and end her day and hour later (22:00 PM). During the weekdays the peaks are between, 5:00 to 8:00 PM, suggesting habitual excercise as work ends.

  • Moderate Use users display peaks in their steps the Saturdays and Sundays, between 8:00 AM to 12:00 PM.

More specfic questions about the data

As we see on the last part, there are some hours where the users have some peaks, we want to investigate is this is related with Exercise sessions (we can go to gym and just do weight or we can spend some time doing cardio on a treadmill). For this we are going to investigate the intensity variable and we want to response some questions:

  • What are the relation between intensity and average steps?
#First we are going to plot the distribution of intensity between the days.
hourlyActivity_join %>%
  mutate(weekday = format(ymd(date), format = '%a'), 
         weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_intensity_hour = mean(average_intensity))%>% 
  ggplot(aes(x = time, y = average_intensity_hour, fill = average_intensity_hour)) +
    viridis::scale_fill_viridis(option = "inferno")+
    geom_col()+
    facet_grid(usage~weekday)+
    theme(axis.text.x = element_text(size = 5, angle = 90))
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

We can see that the plots are very similar between average_intensity and the average_steps per group. So we will expect a linear correlation between both of this variables.

hourlyActivity_join %>%
  mutate(weekday = weekdays(as.Date(date))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_intensity_hour = mean(average_intensity), average_steps = mean(step_total) )%>% 
  ggplot(aes(x = average_intensity_hour, y = average_steps)) +
    geom_point()+
    geom_smooth()+
    facet_wrap(~usage)
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

So as we expected, we have a positive correlation (almost linear) between the average intensity per hour and average steps per hour. So we can associate the high steps to sessions of exercise where the users is very active. Lets also investigate the correlation between the variables and average calories burner.

hourlyActivity_join %>%
  mutate(weekday = weekdays(as.Date(date))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_intensity_hour = mean(average_intensity), average_steps = mean(step_total), average_calories = mean(calories)) %>% 
  GGally::ggpairs(columns = c(4,5,6))
`summarise()` has grouped output by 'time', 'weekday'. You can override using the `.groups` argument.

We can see here a strong correlation between this variables. This is expected, since as we saw before, the high steps are generally associated with high intensity exercise sessions, where the user will tend to burn more calories.

What about sleep behaviur?

Another variable will be interesting to analyze is the sleep behavior.We want to investigate how is the sleep behavior from users according to their active level.

  • What are the relation between active level and sleep hours?
#First we are going to plot the distribution of sleep between the groups.
dailyActivity_join %>%
  group_by(date, usage) %>% 
  summarize(average_sleep_minutes = mean(total_minutes_asleep, na.rm=TRUE)) %>% 
  ggplot(aes(x = usage, y = average_sleep_minutes, fill = usage)) +
    geom_boxplot()
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

Here we see that the users have almost the same mean, independent from there usage group, the missing value on the los use group is due that the only user we have on this group don’t have any data about his sleep behavior. We are going to going more in deep making a classification for the time slept:

  • Bad sleep - slept less than 300 minutes
  • Normal Sleep - slept between 300 and 480 minutes
  • Over Sleep - slept more than 480 minutes
dailyActivity_join <- dailyActivity_join %>%
  mutate(sleep_type =  ifelse(total_minutes_asleep<= 300,  "Bad sleep", 
                       ifelse(total_minutes_asleep <= 480,  "Normal sleep", 
                       ifelse(total_minutes_asleep > 480,  "Over sleep", NA))), 
         sleep_type = factor(sleep_type, level = c('Bad sleep','Normal sleep','Over sleep'))) 
dailyActivity_join %>%
  group_by(sleep_type, id) %>%
  summarize(count_sleep = n()) %>%
  drop_na() %>%
  summarize(total_sleep_type = n())  %>% 
  mutate(perc = total_sleep_type/sum(total_sleep_type))%>%
  mutate(perc = scales::percent(perc)) %>% 
  ggplot(aes(x = "", y = total_sleep_type, fill = sleep_type)) +
    geom_bar(stat='identity', width = 1) +
    coord_polar("y", start=0)+
    theme_void()+
    theme(plot.title = element_text(hjust = 0.5, vjust= -5, size = 20, face = "bold")) +
    geom_text(aes(label = perc, x = 1.2),position = position_stack(vjust = 0.5)) +
    labs(title = "Sleep Type Distribution") +
    guides(fill = guide_legend(title = "sleep Type"))
`summarise()` has grouped output by 'sleep_type'. You can override using the `.groups` argument.

#We can also visualizate this distribution through the different Usage groups.
dailyActivity_join %>%
  group_by(usage, sleep_type, id) %>%
  summarize(count_sleep = n()) %>%
  drop_na() %>%
  summarize(total_sleep_type = n())  %>% 
  mutate(perc = total_sleep_type/sum(total_sleep_type))%>%
  mutate(perc = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = total_sleep_type, fill = sleep_type)) +
    geom_bar(stat='identity', width = 1, position = "fill") +
    coord_polar("y", start=0)+
    theme_void()+
    theme(plot.title = element_text(hjust = 0.5, vjust= 5, size = 20, face = "bold")) +
    geom_text(aes(label = perc, x=1.2), position = position_fill(vjust = 0.5)) +
    labs(title = "Sleep Type Distribution") +
    guides(fill = guide_legend(title = "Sleep Type"))+
    facet_wrap(~usage, strip.position = "bottom")
`summarise()` has grouped output by 'usage', 'sleep_type'. You can override using the `.groups` argument.`summarise()` has grouped output by 'usage'. You can override using the `.groups` argument.

Analyzing our results we can see that 26.4% of the times of user reports a bad sleep, 35.8% of the times they have a normal sleep and 35.8% of the times the over sleep. Through the groups we can see that the distribution is near similar to the global. Note that an user can have one day of each category.

Finally, we are going to relate the sleep behavior against the active level, And we are going to classify our users according to their mean active level. This classification will be based on the average active level of each user against the average active level of all users i.e. if an user has her sedentary average greater than the global sedentary average, this user will be classificate as sedentary. Finally if an user isnt in any categorie, we will exclude from the data.

#we need to make a classification for the active level of the users. First we are going to get the average of all users.  
#And we are going to drop the 0 values making them NA values and ignoring them on the mean calculation
temp <- dailyActivity_join %>%
  na_if(0) %>% 
  mutate(sedentary_minutes_avg = mean(sedentary_minutes, na.rm = TRUE), 
            lightly_active_minutes_avg = mean(lightly_active_minutes, na.rm = TRUE),
            fairly_active_minutes_avg = mean(fairly_active_minutes, na.rm = TRUE),
            very_active_minutes_avg = mean(very_active_minutes, na.rm = TRUE)) %>% 

#We are going to replace NA values with 0 to avoid errors in our categorization. After we gonna make the classifiaction using the statement case_when
  mutate(sedentary_minutes = replace(sedentary_minutes,is.na(sedentary_minutes),0),
        lightly_active_minutes = replace(lightly_active_minutes,is.na(lightly_active_minutes),0),
        fairly_active_minutes = replace(fairly_active_minutes,is.na(fairly_active_minutes),0),
        very_active_minutes = replace( very_active_minutes,is.na( very_active_minutes),0)) %>% 
  mutate(active_type = factor(case_when(sedentary_minutes > sedentary_minutes_avg &
                                 lightly_active_minutes <  lightly_active_minutes_avg &
                                 fairly_active_minutes< fairly_active_minutes_avg &
                                 very_active_minutes < very_active_minutes_avg ~ "Sedentary",
                                 lightly_active_minutes >  lightly_active_minutes_avg &
                                 fairly_active_minutes < fairly_active_minutes_avg &
                                 very_active_minutes < very_active_minutes_avg ~ "Lightly Active",
                                 fairly_active_minutes > fairly_active_minutes_avg &
                                 very_active_minutes < very_active_minutes_avg ~ 'Fairly Active',
                                 very_active_minutes > very_active_minutes_avg ~ 'Very Active'), 
                      levels=c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")))%>%
  drop_na(sleep_type, active_type) 
  
#finally we plot.

temp %>% 
    ggplot(aes(x = active_type, fill = sleep_type)) +
    geom_bar(position = "fill") +
    labs(y = "Proportion")


temp %>% 
    ggplot(aes(x = active_type, fill = sleep_type)) +
    geom_bar(position = "fill") +
    labs(y = "Proportion")+
    facet_wrap(~usage) 

Analyzing our results we can see that Sedentary people tend to have a bad sleep behavior. We can also observe that a little activity on the day will tend to a normal sleep. Also as active level increase the oversleep behavior decreace,

Discussion

The FitBit data set confirms that not all users fully utilize the functions of their devices/trackers. All 33 unique IDs used the step count function. 24/33 unique IDs used the sleep tracking function. 14/33 unique IDs used the heart-rate tracking. 8/33 unique IDs used their devices to track their weight.

High Use Group

This group consists of 24 users or 73% of the total sample size, and wears the device regularly between 22-31 days. This is the most active group, and also the most varied in the types of exercises carried out. Varying from light, to fairly and very active forms of exercises. They tends to be active throughout the week with an average of weekly steps of 9054.500. Their sleep behaviur is symyetrical distribuitde between bad, normal and over sleep behavior.

Moderate Use Group

This group consists of 8 users or 24% of the total sample size, and wears the device between 5 - 21 days. Users in this group are less active and walk fewer steps compared to the ‘High Use’ group over the weekdays but active during the weekends, between 08:00AM to 1:00PM. While significantly less active than the ‘High Use’ group, they also stick to their routine and their sleep behaviur is also symyetrical distribuitde between bad, normal and over sleep behavior.

Low Use Group

This group consists of only 1 user or 3% of the sample size, too small to provide any meaningful analysis. Much of the trends are skewed away from any recognizable patterns. With this in mind, this group displays a similar behaviur to the moderate users group. Not sleep behaviur was registered form this user.

final remarks

Bellabeat’s mission is to empower women by providing them with the data to discover themselves.

In order for us to respond to our business task and help Bellabeat on their mission, based on our results, I would advice to use own tracking data for further analysis. Datasets used have a small sample and can be biased since we didn’t have any demographic details of users. Knowing that our main target are young and adult women I would encourage to continue finding trends to be able to create a marketing stragety focused on them.

That being said, after our analysis we have found different trends that may help our online campaign and improve Bellabeat app:

1. Daily notification for exercise: We classified users into 3 categories and saw that the average of users sleep less than 8 hours a day. However we saw a better beahiur on sleep habit when user increment her activity level. We can encourage customers to reach at least daily recommended steps by sending them alarms if they haven’t reached the steps and creating also posts on our app explaining the benefits of reaching that goal.

2. Notification and sleep techniques: In order to reduce the bad sleep and over sleep behaviur, users could set up a desired time to go to sleep and receive a notification minutes before to prepare to sleep. Also offer helpful resources to help customers sleep - ex. breathing advises, podcasts with relaxing music, sleep techniques.

3. Technical support: Base on the the distributions, we found that many user don’t use all the functionality of the devices. Bellabeats can offer helpfull resources and reminders to help customers configurate their devices and get all the benefits form their pruchase.

#export data to excel for some manuals verifications.
write.table(x = dailyActivity_join, file = "my_file.csv")
---
title: "Bellabeat Case Study: Google Data Analytic"
output: 
  html_notebook:
    toc: true
    theme: united
---

# Summary

[Bellabeat](https://bellabeat.com/) is a high-tech company that manufactures health-focused smart products.They offer different smart devices that collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.

The main focus of this case was to analyze smart devices fitness data and determined how it could help unlock new growth opportunities for Bellabeat. We used information from FitBit-app to gain insight into how consumers use non-Bellabeat smart devices and to improve our product **Leaf smart**. 

## The FitBit Fitness Tracker Data

[The FitBit Fitness Tracker dataset](https://www.kaggle.com/datasets/arashnic/fitbit) is a public data set(CC0: Public Domain) that is made available on Kaggle through the user Mobius.

This dataset is generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016 (1 month period). Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

The dataset is a collection of 18 .csv files. 15 in long format, 3 in wide format. The datasets consists of wide-ranging information from activity metrics, calories, sleep records, metabolic equivalent of tasks (METs), heart rate and steps; in timeframes of seconds, minutes, hours and days. There was no metadata provided. 

# Preview of the Data 

We first want to preview some information about all the data we have, so we can establish a working plan. We are going to load all the tables and make an exploration from the structure of every table. We are going to use the *readr* R package to import the excel data into the r workspace. Also we are going to load some others library that will be use during the process.

```{r}
#We change our system preferences to English for outputs of weekdays, months, etc.
Sys.setlocale("LC_TIME", "English")
```

```{r}
library(readr)
library(tidyverse)
library(janitor)
library(lubridate)
library(DataExplorer)
library(VennDiagram)
```

The next code import all the *.csv* files from a specific directory into R

```{r}
filename<-list.files(path="Fitabase Data 4.12.16-5.12.16/", pattern="*.csv")

for (i in 1:length(filename)) 
  assign(filename[i], read.csv(paste("Fitabase Data 4.12.16-5.12.16/", filename[i], sep="")))

list.files(path="Fitabase Data 4.12.16-5.12.16/", pattern="*.csv")
```

**dailyActivity_merged.csv preview**

The first exploration will be on the **dailyActivity_merged** file. We want to explore the structure of the data. As we can see, this table has 15 variables and a total of 940 observations, where each observations correspond for a day of a specific user (Id). The data in the file are all numerical except for the column ActivityDate, which is a *chr* type. 

```{r}
str(dailyActivity_merged.csv)
```

**dailyCalories_merged.csv preview**

This table has 3 variables and a total of 940 observations, where each observations correspond for a day of a specific user (Id) and the total calories burner that day. So now we know that every table with 940 observations is a resume from each day activity from a specific user and they all join in: "dailyActivity_merged.csv".

```{r}
str(dailyCalories_merged.csv)
```

**heartrate_seconds_merged.csv preview**

This tables has a lot of observations as you can see (2483658). The first we observe is that the Id is different from the id of the others table. So we group by id identify the Ids numbers and try to compare with our previus data. We found that some Ids are missing to respect from the others tables, here we only have 14 vs the 33 on the others tables, this can be due that the individuals missing don't have this function active on her their device due to configuration issues.

```{r}
str(heartrate_seconds_merged.csv)
```

```{r}
heartrate_seconds_merged.csv %>%
  group_by(Id) %>%
  summarise(count = n())%>%
  nrow()

dailyActivity_merged.csv %>%
  group_by(Id) %>%
  summarise(count = n())%>%
  nrow()
```

**hourlyCalories_merged.csv preview**

Now we want to undertsant the data inside the files with 22099 observations. This files contains the calories from a specific user in intervals of 1 hours for each day. So the sum of the data here for each day, should be equal to the data on the file with daily record. We gonna make one validation for the calories, and as we can see there  are some incompatibiltys between the data fo the total calories on *dailyActivity_merged.csv* and *hourlyCalories_merged.csv*.

```{r}
str(hourlyCalories_merged.csv)
```
```{r}
verification_hourlyCalories <- hourlyCalories_merged.csv %>%
  group_by(Id) %>%
  arrange(Id) %>%
  summarise(sum = sum(Calories))

verification_dailyActivity <- dailyActivity_merged.csv %>%
  group_by(Id) %>%
  arrange(Id) %>%
  summarise(sum = sum(Calories))

verification_dailyActivity %>%
  mutate(difference = verification_dailyActivity$sum - verification_hourlyCalories$sum) %>% 
  head()
```

**minuteCaloriesNarrow_merged.csv preview**

We want to explore where the inconsistency start. So we explore the minutes files and compare between the hours file. So we see that we have incompatibilitys also here.

```{r}
verification_minuteCaloriesNarrow <- minuteCaloriesNarrow_merged.csv %>%
  group_by(Id) %>%
  arrange(Id) %>%
  summarise(sum = sum(Calories))

verification_hourlyCalories %>%
  mutate(difference = verification_minuteCaloriesNarrow $sum - verification_hourlyCalories$sum) %>% 
  head()
```

**sleepDay_merged.csv and weightLogInfo_merged.csv preview** 

Finally we are going to explore the last two tables, *sleepDay_merged.csv* and *weightLogInfo_merged.csv*. We can observ that the sleepDay table have information for each day for each user id, however the rows are much less than the *dailyActivity_merged.csv* file, so we invistigate this descrepancy first by the number of user, and we see that some user dont apper here.

```{r}
str(sleepDay_merged.csv)
```

```{r}
sleepDay_merged.csv %>%
  group_by(Id)%>%
  arrange(Id)%>%
  summarise(count = n())%>%
  nrow()
```

For the *weightLogInfo_merged.csv* we observe that only 8 users are on the report.

```{r}
str(weightLogInfo_merged.csv)
```

```{r}
unique(weightLogInfo_merged.csv[c("Id")])%>% 
  nrow()
```

Not keys was found about information such as: participants demographic, age, gender, weather indicators. Unfortunately, this associated with the small sample size would limit the scope of analysis that can be performed.

## Cleaning and Formatting Data-sets

We are going to combine all the data from *dailyActivity_merged.csv*, *sleepDay_merged.csv* and *weightLogInfo_merged.csv* in a single data-set. First we are going to Cleaning and Formatting the Data-sets. We are going to make all the variables lowercase through the function *clean_names()* from the **jupiter library**. Also we are going to change the name of all the dates variables to the name **date** in all tables, and finally change the format of dates to year-month-day.

```{r}
dailyActivity_clean <- dailyActivity_merged.csv %>%
 clean_names() %>%
 rename(date = activity_date)%>%
 mutate(date = as.Date(date, format = "%m/%d/%Y"))

sleepDay_clean <- sleepDay_merged.csv %>%
 clean_names() %>%
 rename(date = sleep_day)%>%
 mutate(date = as.Date(date, format = "%m/%d/%Y"))


weightLogInfo_clean <- weightLogInfo_merged.csv %>%
 rename(date = Date)%>%
 clean_names() %>%
 mutate(date = as.Date(date, format = "%m/%d/%Y"))
```

Now we are going to prepare the data for a merge/join between tables, so we need to clean the data from any duplicate and null value.

```{r}
sum(duplicated(dailyActivity_clean))
sum(is.na (dailyActivity_clean))

sum(duplicated(sleepDay_clean))
sum(is.na (sleepDay_clean))

sum(duplicated(weightLogInfo_clean))
sum(is.na (weightLogInfo_clean))
```

So we have found that sleepDay_clean have duplicate values and there are Null values in weightLogInfo_clean, however this are only for one column (fat), so we are only going to clean for the duplicates.

```{r}
sleepDay_clean <- sleepDay_clean %>%
  distinct()
```

Finally we are going to merge all the data in one data-frame and change the format of id from numeric to string for classify each user as a categorie.

```{r}
dailyActivity_join <- dailyActivity_clean %>%
  left_join(sleepDay_clean, by = c("id", "date")) %>%
  left_join(., weightLogInfo_clean, by = c("id", "date")) 

#now we change the data type for the id column

dailyActivity_join$id <- as.character(dailyActivity_join$id)

head(dailyActivity_join)
```

We also are going to use the data on *hourlyCalories_merged.csv*, *hourlyIntensities_merged.csv* and *hourlySteps_merged.csv*. We are just going to review for any duplicate.

```{r}
sum(duplicated(hourlyCalories_merged.csv))

sum(duplicated(hourlyIntensities_merged.csv))

sum(duplicated(hourlySteps_merged.csv))
```

Now we are going to format the hours and also clean the names.

```{r}
hourlyCalories_clean <- hourlyCalories_merged.csv %>%
 clean_names() %>%
 rename(date_time = activity_hour)%>%
 mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))

hourlyIntensities_clean <- hourlyIntensities_merged.csv %>%
 clean_names() %>%
 rename(date_time = activity_hour)%>%
 mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))


hourlySteps_clean <- hourlySteps_merged.csv %>%
 rename(date_time = ActivityHour)%>%
 clean_names() %>%
 mutate(date_time = mdy_hms(hourlyCalories_merged.csv$ActivityHour))
```

Since we not found any duplicate, we are going to merge all the data in one single file *hourlyActivity_join*

```{r}
hourlyActivity_join <- hourlyCalories_clean %>%
  inner_join(hourlyIntensities_clean, by = c("id", "date_time"))%>%
  inner_join(.,hourlySteps_clean, by = c("id", "date_time"))

#We also going to separate the date form the hour for management facility
  hourlyActivity_join <- hourlyActivity_join %>%
  separate(date_time, into = c("date", "time"), sep= " ")%>%
  
#and we going to change the format of the hour to only show hour and minute
  mutate(time = format(parse_date_time(as.character(time), "HMS"), format = "%H:%M"))

#now we change the data type for the id column
hourlyActivity_join$id <- as.character(hourlyActivity_join$id)

head(hourlyActivity_join)
```

Since we already merge our maindataframes, we can drop all the others files form the R environment (for performance and cleanliness).

```{r}
#First we list all the dataframes we have to visualizate them
ls()
```


```{r}
#Now we drop all dataframes except  the ones we create and will use on the future.
rm(list=setdiff(ls(), c("dailyActivity_join", 'hourlyActivity_join', 'dailyActivity_clean', 'sleepDay_clean', 'weightLogInfo_clean', 'heartrate_seconds_merged.csv')))
```

Finally, in reality we are not going to use all columns in **dailyActivity_join**, so we can drop some columns (for performance and cleanliness).

```{r}
dailyActivity_join <- dailyActivity_join %>% 
  select(-c(total_distance,
            tracker_distance,
            logged_activities_distance, 
            very_active_distance, 
            moderately_active_distance, 
            light_active_distance,
            sedentary_active_distance, 
            total_sleep_records,  
            total_time_in_bed,  
            weight_kg,
            weight_pounds,
            fat,
            bmi,
            is_manual_report,
            log_id))
```


## Normality Analyze of data frames

Here we are going to investigate the normality of the numerical data, to know more about the limitations about our data. Lest start with the variables inside *dailyActivity_join*:

```{r}
#Here we going to use the library  DataExplorer, since our data frame have some categorical variables and will be difficult to make a loop for ggplot2.
dailyActivity_join %>%
  plot_histogram( 
    ncol = 3,
    ggtheme = theme_light()
    )
```

We can see that some variables have near a normal behavior with little skew or abnormally values. i.e. *calories, total_minutes_aesleep, lightly_active_minutes* and others have a strong right skewed distributions i.e. *fairly_active_minutes* and *very_active_minutes*. 

Now we are analyze the data inside *hourlyActivity_join*:

```{r}
hourlyActivity_join %>%
  plot_histogram( 
    ncol = 3,
    ggtheme = theme_light()
    )
```

Here we can see that all variables are right skewed. This is related to fact that most of the hours the people are going to be working or sleeping, and since the intensity is low is normal to have a skewed plot for the calories.

# Data analyze

## Distribution of the tracking of the devices

We are ready to make some questions from our Data. The first question we want to investigate is:

* **Which is the distribution of the usage of the apps on the differents activities?**

We already know the answer to this question thanks to the initial exploration we did. We have 33 user that use her device to track her daily activity, 24 users that track her sleep behavior, 8 users that tracks her weight loss/gain and 14 users that track her heart rate. So let put this information on a plot.

```{r}
# We are going to plot a Venn diagram between the 4 file dailyActivity_clean, sleepDay_clean, weightLogInfo_clean and heartrate_seconds_merged.csv
#First we need to create the sets. We are going to create for each dataframe a set of unique Ids.

step_ids <- unique(dailyActivity_clean$id, incomparables = FALSE)
sleep_ids <- unique(sleepDay_clean$id, incomparables = FALSE)
heartrate_ids <- unique(heartrate_seconds_merged.csv$Id, incomparables = FALSE)
weight_ids <- unique(weightLogInfo_clean$id, incomparables = FALSE)

#now we create the graph, Frist we need a list vector.
x <- list(A=step_ids, B=sleep_ids, C=heartrate_ids, D=weight_ids)

#function to display Venn diagram inside markdown, for this we need to call the library VennDiagram
display_venn <- function(x, ...){
  grid.newpage()
  venn_object <- venn.diagram(x, filename = NULL, ...)
  grid.draw(venn_object)
}

#display Venn diagram
display_venn(
  x,
  category.names = c("Steps count", "Sleep monitor", "Heart monitor", "Weight tracking"),
  fill = c("#999999", "#E69F00", "#56B4E9", "#009E73")
  )
```

## Type of users per activity level

Here we will ascertain how often the participants use their smart devices. With daily_activity, we will assume that days with < 200 TotalSteps taken, are days where users have not used their watches. We will filter out these inactive day and assign the following designations:

* Low Use - 1 to 5 days 
* Moderate Use - 5 to 20 days 
* High Use - 21 to 31 days

Breaking down the analysis further in this way will help us understand the different trends underlying each Usage Groups.

```{r}
#Here we create a table to classify the users according to the times they appear in the data frame
dailyActivity_join %>%
  filter(total_steps > 200) %>%
  group_by(id) %>%
  summarize(count = n()) %>%
  mutate(usage =  ifelse(count <= 5,  "Low use", 
                        ifelse(count <= 20,  "Moderate use", 
                        ifelse(count <= 31,  "High Use", NA))))%>%

#We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.
#the :: here call the library scales to use the function percent, since we only usign once, we dont need to load the library.
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(perc = total/sum(total))%>%
  mutate(perc = scales::percent(perc)) %>% 

#Now that we have our new table we can create our plot.
  ggplot(aes(x = "", y = total, fill = usage )) +
    geom_bar(stat='identity', width = 1) +
    coord_polar("y", start=0)+
    theme_void()+
    theme(plot.title = element_text(hjust = 0.5, vjust= -5, size = 20, face = "bold")) +
    geom_text(aes(label = perc, x = 1.25),position = position_stack(vjust = 0.5)) +
    labs(title = "Usage Group Distribution") +
    guides(fill = guide_legend(title = "Usage Type"))
```

Analyzing our results we can see that 63.6% of the users of our sample use their device frequently almost very day - between 25 to 31 days, 27.3% use their device 15 to 25 days. 6.1% of our sample use their device between 5 to 15 days and 3.0% use their devices very rarely. 

## Time used smart device and distribution 

We will analyse the steps taken by users within and between groups per day and hour. Lets start with the daily steps for each user between groups.

```{r}
#here we create a new column on our data frame with the classification we did before, since we are going to need it for the rest of the analyzes.
dailyActivity_join <- dailyActivity_join %>%
  filter(total_steps > 200) %>%
  group_by(id) %>%
  mutate(count = n()) %>%
  mutate(usage =  ifelse(count <= 5,  "Low use",
                        ifelse(count <= 20,  "Moderate use",
                        ifelse(count <= 31,  "High use", NA))), groups="drop") %>%

#We are going to organize the level in the order we want they appear on the plots.
 mutate(usage = factor(usage, level = c('Low use','Moderate use','High use'))) %>% 
  
#As we group and apply this to our main dataframe, we need to ungroup or we are going to get all values of summarize function grouping by id.
 ungroup(id)
```
  
```{r}
dailyActivity_join %>%
  ggplot(aes(x = date, y = total_steps, group = id, color = id)) +
    geom_line() +
    theme(legend.position = "none")+
    facet_wrap(~usage, ncol = 1)
```

There is not specific trend here, since some very High use users have some days with low total steps. Now we are going to plot the average steps by day of each group.

```{r}
dailyActivity_join %>%
  group_by(usage, date) %>% 
  summarize(average_steps = mean(total_steps)) %>%
  ggplot(aes(x = date, y = average_steps, fill = usage,  color = usage)) + 
     geom_col()+
     facet_wrap(~usage)
```

Now we going to visualizate this on a better manner trough a boxplot diagram.

```{r}
dailyActivity_join %>%
  group_by(usage, date) %>% 
  summarize(average_steps = mean(total_steps)) %>%
  ggplot(aes(x = usage, y = average_steps, fill = usage,  color = usage)) + 
     geom_boxplot()
```

Finally, we are going to plot the average use of the devices per week day.

```{r}
#First we create and column that containt each weekday
dailyActivity_join %>%
  mutate(weekday = weekdays(as.Date(date)), 
         weekday = fct_relevel(weekday, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>% 

#Now we group by usage and weekday, get the average, the confidence interval and finally we plot.
  group_by(weekday, usage) %>% 
  summarize(average_steps = mean(total_steps), ci = qt(0.975, n())*sd(total_steps)/sqrt(n()))%>%
  ggplot(aes(x = weekday, y = average_steps, fill = usage,  color = usage)) +
     geom_col()+
#code for add intervals of confidence  
     #geom_errorbar(aes(ymin =  average_steps - ci, ymax =  average_steps + ci), width = 0.2, colour = 'black') +
     facet_wrap(~usage, ncol=1)
```

We can see some patrons from our data:

* Average steps per day increases as usage of devices increases, we are going to invistigate more on this in the next section.

* For moderate and high use users, there is not a clear day that show a higher mean than the other days (is necessary to do a t-test, however you need to be aware that data is not independent within groups and between groups).

* Low use users (1 individue) does not seem to display any difference on the mean against the moderate use users.

## Usage during the day (a more in deep analysis)

Now that we have some trends of usage, we want to the distribution of usage during the day of the devices, and how this is correlate to some activities. For this we are going to be working with the *hourlyActivity_join* table. The first we are going to investigate is the distribution usage of the devices during each day of the week for each group. 

```{r}
#Since this is other data frame, we need to make the classification again, first we sum the values of total steps per day to filter bt values > 200 on the other step
hourlyActivity_join <- hourlyActivity_join %>%
  group_by(id, day(date)) %>%
  rename(day = "day(date)") %>%
  mutate(total_steps = sum(step_total)) %>% 
  ungroup(id, day)
```

```{r}
#now we sum the days a user use the devices and make the categorization.
hourlyActivity_join <- hourlyActivity_join %>%
  filter(total_steps > 200) %>%
  group_by(id) %>% 
  mutate(days_usage = n_distinct(day(date))) %>% 
  mutate(usage =  ifelse(days_usage <= 5,  "Low use", 
                        ifelse(days_usage <= 20,  "Moderate use", 
                        ifelse(days_usage  <= 31,  "High use", NA)))) %>% 
#We are going to organize the level in the order we want they appear on the plots.
  mutate(usage = factor(usage, level = c('Low use','Moderate use','High use'))) %>% 
  ungroup(id)
```

```{r}
#Now we plot
hourlyActivity_join  %>%
  mutate(weekday = format(ymd(date), format = '%a'), 
         weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_steps = mean(step_total)) %>% 
  ggplot(aes(x = time, y = average_steps, fill = average_steps)) +
    viridis::scale_fill_viridis(option = "D")+
    geom_col()+
    facet_grid(usage~weekday)+
    theme(axis.text.x = element_text(size = 5, angle = 90))
```

```{r}
#We also going to make a heat plot for the same distribution to have other options for presentation.
hourlyActivity_join %>%
  mutate(weekday = format(ymd(date), format = '%a'), 
         weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
  group_by(weekday,time, usage) %>% 
  summarize(average_steps = mean(step_total)) %>% 
  ggplot(aes(x = time, y = weekday, fill = average_steps)) +
    viridis::scale_fill_viridis(option = "D")+
    geom_tile()+
    geom_text(aes(label = round(average_steps, digits = 0)), color = "black", size = 2.0) +
    facet_wrap(~usage, ncol=1)+
    theme(axis.text.x = element_text(size = 5, angle = 90))
```

We can see some patrons from our data:

* The high use users start their day an hour earlier (6:00AM) compared to other groups and end her day and hour later (22:00 PM). During the weekdays the peaks are between, 5:00 to 8:00 PM, suggesting habitual excercise as work ends.

* Moderate Use users display peaks in their steps the Saturdays and Sundays, between 8:00 AM to 12:00 PM.

## More specfic questions about the data

As we see on the last part, there are some hours where the users have some peaks, we want to investigate is this is related with Exercise sessions (we can go to gym and just do weight or we can spend some time doing cardio on a treadmill). For this we are going to investigate the intensity variable and we want to response some questions:

* **What are the relation between intensity and average steps?**

```{r}
#First we are going to plot the distribution of intensity between the days.
hourlyActivity_join %>%
  mutate(weekday = format(ymd(date), format = '%a'), 
         weekday = fct_relevel(weekday, c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_intensity_hour = mean(average_intensity))%>% 
  ggplot(aes(x = time, y = average_intensity_hour, fill = average_intensity_hour)) +
    viridis::scale_fill_viridis(option = "inferno")+
    geom_col()+
    facet_grid(usage~weekday)+
    theme(axis.text.x = element_text(size = 5, angle = 90))
```

We can see that the plots are very similar between average_intensity and the average_steps per group. So we will expect a linear correlation between both of this variables.

```{r}
hourlyActivity_join %>%
  mutate(weekday = weekdays(as.Date(date))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_intensity_hour = mean(average_intensity), average_steps = mean(step_total) )%>% 
  ggplot(aes(x = average_intensity_hour, y = average_steps)) +
    geom_point()+
    geom_smooth()+
    facet_wrap(~usage)
```

So as we expected, we have a positive correlation (almost linear) between the average intensity per hour and average steps per hour. So we can associate the high steps to sessions of exercise where the users is very active. Lets also investigate the correlation between the variables and average calories burner.

```{r}
hourlyActivity_join %>%
  mutate(weekday = weekdays(as.Date(date))) %>%
  group_by(time, weekday, usage) %>% 
  summarize(average_intensity_hour = mean(average_intensity), average_steps = mean(step_total), average_calories = mean(calories)) %>% 
  GGally::ggpairs(columns = c(4,5,6))
```

We can see here a strong correlation between this variables. This is expected, since as we saw before, the high steps are generally associated with high intensity exercise sessions, where the user will tend to burn more calories. 

### What about sleep behaviur?

Another variable will be interesting to analyze is the sleep behavior.We want to investigate how is the sleep behavior from users according to their active level.

* **What are the relation between active level and sleep hours?**

```{r}
#First we are going to plot the distribution of sleep between the groups.
dailyActivity_join %>%
  group_by(date, usage) %>% 
  summarize(average_sleep_minutes = mean(total_minutes_asleep, na.rm=TRUE)) %>% 
  ggplot(aes(x = usage, y = average_sleep_minutes, fill = usage)) +
    geom_boxplot()
```

Here we see that the users have almost the same mean, independent from there usage group, the missing value on the los use group is due that the only user we have on this group don't have any data about his sleep behavior. We are going to going more in deep making a classification for the time slept:

* Bad sleep - slept less than 300 minutes
* Normal Sleep  - slept between 300 and 480 minutes
* Over Sleep - slept more than 480 minutes

```{r}
dailyActivity_join <- dailyActivity_join %>%
  mutate(sleep_type =  ifelse(total_minutes_asleep<= 300,  "Bad sleep", 
                       ifelse(total_minutes_asleep <= 480,  "Normal sleep", 
                       ifelse(total_minutes_asleep > 480,  "Over sleep", NA))), 
         sleep_type = factor(sleep_type, level = c('Bad sleep','Normal sleep','Over sleep'))) 
```

```{r}
dailyActivity_join %>%
  group_by(sleep_type, id) %>%
  summarize(count_sleep = n()) %>%
  drop_na() %>%
  summarize(total_sleep_type = n())  %>% 
  mutate(perc = total_sleep_type/sum(total_sleep_type))%>%
  mutate(perc = scales::percent(perc)) %>% 
  ggplot(aes(x = "", y = total_sleep_type, fill = sleep_type)) +
    geom_bar(stat='identity', width = 1) +
    coord_polar("y", start=0)+
    theme_void()+
    theme(plot.title = element_text(hjust = 0.5, vjust= -5, size = 20, face = "bold")) +
    geom_text(aes(label = perc, x = 1.2),position = position_stack(vjust = 0.5)) +
    labs(title = "Sleep Type Distribution") +
    guides(fill = guide_legend(title = "sleep Type"))
```

```{r}
#We can also visualizate this distribution through the different Usage groups.
dailyActivity_join %>%
  group_by(usage, sleep_type, id) %>%
  summarize(count_sleep = n()) %>%
  drop_na() %>%
  summarize(total_sleep_type = n())  %>% 
  mutate(perc = total_sleep_type/sum(total_sleep_type))%>%
  mutate(perc = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = total_sleep_type, fill = sleep_type)) +
    geom_bar(stat='identity', width = 1, position = "fill") +
    coord_polar("y", start=0)+
    theme_void()+
    theme(plot.title = element_text(hjust = 0.5, vjust= 5, size = 20, face = "bold")) +
    geom_text(aes(label = perc, x=1.2), position = position_fill(vjust = 0.5)) +
    labs(title = "Sleep Type Distribution") +
    guides(fill = guide_legend(title = "Sleep Type"))+
    facet_wrap(~usage, strip.position = "bottom")
```

Analyzing our results we can see that 26.4% of the times of user reports a bad sleep, 35.8% of the times they have a normal sleep and 35.8% of the times the over sleep. Through the groups we can see that the distribution is near similar to the global. Note that an user can have one day of each category.

Finally, we are going to relate the sleep behavior against the active level, And we are going to classify our users according to their mean active level. This classification will be based on the average active level of each user against the average active level of all users i.e. if an user has her sedentary average greater than the global sedentary average, this user will be classificate as sedentary. Finally if an user isnt in any categorie, we will exclude from the data.

```{r}
#we need to make a classification for the active level of the users. First we are going to get the average of all users.  
#And we are going to drop the 0 values making them NA values and ignoring them on the mean calculation
temp <- dailyActivity_join %>%
  na_if(0) %>% 
  mutate(sedentary_minutes_avg = mean(sedentary_minutes, na.rm = TRUE), 
            lightly_active_minutes_avg = mean(lightly_active_minutes, na.rm = TRUE),
            fairly_active_minutes_avg = mean(fairly_active_minutes, na.rm = TRUE),
            very_active_minutes_avg = mean(very_active_minutes, na.rm = TRUE)) %>% 

#We are going to replace NA values with 0 to avoid errors in our categorization. After we gonna make the classifiaction using the statement case_when
  mutate(sedentary_minutes = replace(sedentary_minutes,is.na(sedentary_minutes),0),
        lightly_active_minutes = replace(lightly_active_minutes,is.na(lightly_active_minutes),0),
        fairly_active_minutes = replace(fairly_active_minutes,is.na(fairly_active_minutes),0),
        very_active_minutes = replace( very_active_minutes,is.na( very_active_minutes),0)) %>% 
  mutate(active_type = factor(case_when(sedentary_minutes > sedentary_minutes_avg &
                                 lightly_active_minutes <  lightly_active_minutes_avg &
                                 fairly_active_minutes< fairly_active_minutes_avg &
                                 very_active_minutes < very_active_minutes_avg ~ "Sedentary",
                                 lightly_active_minutes >  lightly_active_minutes_avg &
                                 fairly_active_minutes < fairly_active_minutes_avg &
                                 very_active_minutes < very_active_minutes_avg ~ "Lightly Active",
                                 fairly_active_minutes > fairly_active_minutes_avg &
                                 very_active_minutes < very_active_minutes_avg ~ 'Fairly Active',
                                 very_active_minutes > very_active_minutes_avg ~ 'Very Active'), 
                      levels=c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")))%>%
  drop_na(sleep_type, active_type) 
  
#finally we plot.

temp %>% 
    ggplot(aes(x = active_type, fill = sleep_type)) +
    geom_bar(position = "fill") +
    labs(y = "Proportion")

temp %>% 
    ggplot(aes(x = active_type, fill = sleep_type)) +
    geom_bar(position = "fill") +
    labs(y = "Proportion")+
    facet_wrap(~usage) 
```

Analyzing our results we can see that Sedentary people tend to have a bad sleep behavior. We can also observe that a little activity on the day will tend to a normal sleep. Also as active level increase the oversleep behavior decreace,

# Discussion

The FitBit data set confirms that not all users fully utilize the functions of their devices/trackers. All 33 unique IDs used the step count function. 24/33 unique IDs used the sleep tracking function. 14/33 unique IDs used the heart-rate tracking. 8/33 unique IDs used their devices to track their weight.

**High Use Group**

This group consists of 24 users or 73% of the total sample size, and wears the device regularly between 22-31 days. This is the most active group, and also the most varied in the types of exercises carried out. Varying from light, to fairly and very active forms of exercises. They tends to be active throughout the week with an average of weekly steps of 9054.500. Their sleep behaviur is symyetrical distribuitde between bad, normal and over sleep behavior.

**Moderate Use Group**

This group consists of 8 users or 24% of the total sample size, and wears the device between 5 - 21 days. Users in this group are less active and walk fewer steps compared to the ‘High Use’ group over the weekdays but active during the weekends, between 08:00AM to 1:00PM. While significantly less active than the ‘High Use’ group, they also stick to their routine and their sleep behaviur is also symyetrical distribuitde between bad, normal and over sleep behavior.


**Low Use Group**

This group consists of only 1 user or 3% of the sample size, too small to provide any meaningful analysis. Much of the trends are skewed away from any recognizable patterns. With this in mind, this group displays a similar behaviur to the moderate users group. Not sleep behaviur was registered form this user.

# final remarks 

Bellabeat's mission is to empower women by providing them with the data to discover themselves.

In order for us to respond to our business task and help Bellabeat on their mission, based on our results, I would advice to use own tracking data for further analysis. Datasets used have a small sample and can be biased since we didn't have any demographic details of users. Knowing that our main target are young and adult women I would encourage to continue finding trends to be able to create a marketing stragety focused on them.

That being said, after our analysis we have found different trends that may help our online campaign and improve Bellabeat app:

**1. Daily notification for exercise:**	We classified users into 3 categories and saw that the average of users sleep less than 8 hours a day. However we saw a better beahiur on sleep habit when user increment her activity level. We can encourage customers to reach at least  daily recommended steps by sending them alarms if they haven't reached the steps and creating also posts on our app explaining the benefits of reaching that goal. 

**2. Notification and sleep techniques:**	In order to reduce the bad sleep and over sleep behaviur, users could set up a desired time to go to sleep and receive a notification minutes before to prepare to sleep. Also offer helpful resources to help customers sleep - ex. breathing advises, podcasts with relaxing music, sleep techniques.

**3. Technical support: ** Base on the the distributions, we found that many user don't use all the functionality of the devices. Bellabeats can offer helpfull resources and reminders to help customers configurate their devices and get all the benefits form their pruchase.

```{r}
#export data to excel for some manuals verifications.
write.table(x = dailyActivity_join, file = "my_file.csv")
```

