Laundry Statistics

Part 1 – Background

The apartment which I live in offers a service called “LaundryView”. Each floor of the apartment has an entry on the LaundryView website [1] which displays the current status of the machines on that floor. Here’s a screenshot from one of the floors in my building.

A sample of the LaundryView website

As you can see, washers and dryers can be available, in use, or unavailable (in need of repairs). With a clear, reliable source of data, I thought it might be interesting to log the status of the machines in my building and do some analysis later on.

Part 2 – Logging

I started out by writing a program in Java to log the status of the machines in my building every 5 minutes. It uses JSoup [2] to download the webpages for all the floors in my building. I used the methods which JSoup includes to parse through the HTML and find the table element which contains the washer and dryer data. The current status of each machine is found using its respective icon image.

Once I found a good way to find the statuses, the rest of the work was pretty simple. I wrote a Machine class with basic attributes (status, machine id, and additional information). New Machines are instantiated and stored in a LinkedList for each floor. After all the machines are parsed, the machine data is printed to STDOUT in CSV format.

Picking a good CSV format for logging was tricky. I had trouble weighing the benefits of including more information with the downsides of a larger file size. In the end, I went with a fairly minimalistic approach. The CSV format I settled on was “date/time, floor #, W1, W2, W3, D1, D2, D3” with W1 representing the state of washer 1, D2 representing the state of dryer 2, etc. The three states (available, in use, and unavailable) and represented by A, I, and U respectively. Some floors only have 2 washer and 2 dryers, however. If this is the case, I opted to store a 0 in the W3 and D3 place. Thus, a sample from the CSV file would look as such…

2018-01-02 16:10,14,A,A,A,A,A,A
2018-01-02 16:15,4,A,A,0,A,A,0
2018-01-02 16:15,5,A,A,0,A,A,0
2018-01-02 16:15,6,A,A,0,A,A,0
2018-01-02 16:15,7,A,A,0,I,U,0
2018-01-02 16:15,8,A,A,0,A,A,0
2018-01-02 16:15,9,A,A,0,I,I,0
2018-01-02 16:15,10,A,A,0,A,A,0
2018-01-02 16:15,11,A,A,I,A,A,A
2018-01-02 16:15,12,A,A,0,A,A,0
2018-01-02 16:15,14,A,A,A,A,A,A
2018-01-02 16:20,4,A,A,0,A,A,0

Working as a Linux systems admin for Madison’s College of Engineering, I knew that we offered a user crontab service on our Linux lab machines. I wrote a bash wrapper that runs the java logger, saves the output, and adds it to the end of a file named <date>.csv. Then I made a crontab entry to run the wrapper every 5 minutes. The setup has been running for 9 days and all has gone well so far.

Part 3 – Analysis

Part of the reason why I wanted to do this laundry project was to learn R. I was inspired by a blog post I saw on Reddit which used R to analyze to the effectiveness of different water filters [3]. Until a week ago I knew nothing about R, so trying to learn how to analyze real data I collected was exciting and interesting.

I started out by finding a good way to get the CSV files into R. So far I’ve been manually using scp to copy the files from the engineering network to a directory on my local machine. The R script then reads the list of files in that directory and uses read.csv to convert them into tables. I then combine all of the tables into one large table which contains all the data from all the days I’ve been collecting.

Next, I wrote a function which finds the usage, or load, percentage for one row of the table (one floor at one specific time). It calculates the load of the washers, the load of the dryers, and the overall load and saves those values into a new column on the table. This function is run on all of the rows of the table.

I knew before starting to write the R code that it was really easy to create graphs in R, and I was hoping that a graph of load vs time would be easy enough to make. I figured that I would start simple, only analyzing one floor at a time. Using the ggplot2 library, I was able to create a graph of the washer and dryer loads for a given floor, and the results were… well… as expected.

Washer and dryer use for a single floor

You can see that when the washer load increases for 50 minutes, shortly after the dryer load increases for 50 minutes. Further, if the washer load was at 50%, the dryer load is normally at 50% afterward. This intuitively makes sense – If someone uses one washer, they’re normally going to use only one dryer afterward. If they use all the washers, they’re normally going to use all the dryers. Nothing super revealing so far, but still a nice introduction to creating graphs and doing data visualization.

Next, I wanted to graph the total average load vs. time. That is, the percentage of all the machines in the building being used vs time. This was tricky for me because I needed to find a way to combine the data for all 9 floors with the same timestamp. I ended up using a combination of the melt and cast functions, with the arguments for cast being “time~variable” and “mean”. From what I understand, this allowed for my table to be cast so that all rows with the same time entry merged into one row, using the mean of their load percentages as the total average load. This is still the part in my script that I’m most confused about, so I’m hoping to get better at manipulating data over time. Nonetheless, my method worked well enough to get me a total average load for every 5 minute period. I started by graphing the total average load vs time and found something very interesting…

The total average load never reached 0! This meant that for the past week at least one machine was always running… or did it? I was skeptical of the result I came up with, mostly because any college kid I know doesn’t want to do laundry at 3 am on a Monday night. To troubleshoot, I did the same single floor analysis described above on each of the floors and ended up finding the problem. The status of the dryers on floor 9 seems to be flip-flopped.

My hypothesis so far is that the dryers are listed as in use when they are free and as available when they are in use. I’ll have to confirm this when I get back to Madison and put in a maintenance request to get it fixed if it ends up being the case.

I removed all of the data from floor 9 and, sure enough, the graph shifted downward, now reaching 0% use in some points.

The change from removing floor 9
Total average laundry use (except floor 9), with a regression line using ggplot2’s geom_smooth

I’m interested to see the large spike in laundry usage after Christmas break. Not only would this be useful to bring out when the peak usage points are, but I’m hoping it will lead to more discoveries and conclusions in general.

Lastly, I threw the R script and the log data up to this point into a GitHub repo [4] in case anyone wanted to see the code.  I still have a lot to learn about R, most of which probably involves efficiency. I’m pretty sure, for instance, that my calcLoad function is needlessly lengthy, and could be done with a well crafted inline function. That’s something I’m hoping to improve on in the future.





1 thought on “Laundry Statistics”

Leave a Reply

Your email address will not be published. Required fields are marked *