When I encounter a new data set that I think may be useful to our clients, I have to evaluate the data and clean it up (a process we call sanitization) before passing it along to the smart people at IdeaSpot that will use the data to create more valuable products for our clients.
One of my first steps is creating a few visualizations of the data. I do this for a few reasons:
I. Visualization allows you to easily see large outliers that may indicate errors in how the data was initially collected. You can find outliers by sorting the data using Excel or SQL, but you have to sort it in many different ways to check for outliers. a quick visualization will save you a lot of time.
II. If the data is regularly reported, for instance on a monthly basis, visualizing multiple month’s results allow you to see changes over time without having to construct complex growth formulas. If the data turns out to be valuable, we will construct the growth formulas, but creating them now could turn out to be wasted time if the data turns out to be worthless.
III. Pictures are pretty and I like looking at them. Numbers, letters, and symbols stuffed into grid format are boring and unattractive.
Now lets turn to an example. I found hotel tax receipt data for the state of Texas in .csv format here (I used the data from February 2016):
http://www.texastransparency.org/Data_Center/Search_Datasets.php
My results are stored here if you would like to view them while I go through my visualization process:
https://public.tableau.com/profile/jason.walters#!/vizhome/VuisualizingOutliers/Outliers
I downloaded the file and loaded it into Tableau Public, a free data visualization tool. I then created bar charts showing total room capacity and taxable hotel receipts by county. You can view the bar charts by clicking on the ‘BarChart’ tab in the upper left of the screen. I next created a ‘ReceiptMap’ and ‘CapacityMap’ that respectively display the total taxable hotel receipts and number of hotel rooms available for every zip code in the State of Texas. So far, nothing is really jumping out as a problem with the data. The zip codes with high room capacity also have higher receipts.
The next step I took was to combine the capacity and receipts maps to see what the average cost of a room is in each zip code. It is the ‘ReceiptsPerRoomMap’ tab. Here we start to see a little trouble with the data. There should be variation across the state for the average revenue a hotel room generates, dense cities have more expensive real estate so hotels have to generate more revenue per room to stay in business. However, within a city, the revenues per hotel room should be fairly similar. If you look at Dallas, Houston, and San Antonio, there are some extremely large circles right next next to extremely small circles and this warrants further investigation.
I then created a packed bubble diagram (‘Outliers’ tab) that represents every zip code in the state as a bubble where the size of the bubble represents the average receipts generated by a hotel room in the zip code. Clicking on the largest bubble, zip code (76092) shows that the zip code reported $1.1 million in receipts with an average receipt per room of $1.1 million. That must be one amazing hotel room!
Digging back into the original file we downloaded from the state, you can see that the hotel in 76092 has only has one room listed. The hotel is a large Hilton in Southlake, Texas with 248 rooms and 48 suites. In fact, most of the very large circle have the exact same problem. One or more of the hotels in the largest bubbles/zip codes are erroneously listed as having only one room for an entire hotel.
Large datasets often have many problems like this that need to fixed before any analysis can be performed. If you do not carefully inspect and clean up the data, the results of your analysis will be worthless. In future posts, I will cover additional techniques for recognizing and resolving problems with publicly aggregated data.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Non eget pharetra nibh mi, neque, purus.