An analysis of the inspection data of the restaurants in the 88 cities of Los Angeles county. The project was done as part of INFORMS Analytics Challenge at the University of Texas at Dallas. The entire summary of the project can be found in the project report.
- General info
- Screenshots
- Technologies and Tools
- Setup
- Code Examples
- Features
- Status
- Inspiration
- Contact
The data provided was published by the city of Los Angeles on Environment Health inspection and enforcement results from restaurants in the Los Angeles county. These data cover 85 of 88 cities and all unincorporated areas in the LA county. We have analyzed the data to answer 5 questions asked in the competition. We have also made a model to predict the health grade of the restaurant using only its address.
The entire presentation of the project can be found here.
- Python
- Tableau
- Microsoft Excel
There are two datasets available: (i) market inspection dataset: contains results of inspection; (ii) market violations dataset: contains information on health code violations in restaurants. The data was sourced in February 2019 and has data till January 16, 2019. Data can be downloaded from inspection data and violations data.
The code can be used to replicate the results. The tableau visualizations can be found here.
Some examples of usage:
Naive Bayes Classification
def train(self,dataset,labels):
self.examples=dataset
self.labels=labels
self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])
#only convert to numpy arrays if initially not passed as numpy arrays - else its a useless recomputation
if not isinstance(self.examples,np.ndarray): self.examples=np.array(self.examples)
if not isinstance(self.labels,np.ndarray): self.labels=np.array(self.labels)
#constructing BoW for each category
for cat_index,cat in enumerate(self.classes):
all_cat_examples=self.examples[self.labels==cat] #filter all examples of category == cat
#get examples preprocessed
cleaned_examples=[preprocess_string(cat_example) for cat_example in all_cat_examples]
cleaned_examples=pd.DataFrame(data=cleaned_examples)
#now costruct BoW of this particular category
np.apply_along_axis(self.addToBow,1,cleaned_examples,cat_index)
prob_classes=np.empty(self.classes.shape[0])
all_words=[]
cat_word_counts=np.empty(self.classes.shape[0])
for cat_index,cat in enumerate(self.classes):
#Calculating prior probability p(c) for each class
prob_classes[cat_index]=np.sum(self.labels==cat)/float(self.labels.shape[0])
#Calculating total counts of all the words of each class
count=list(self.bow_dicts[cat_index].values())
cat_word_counts[cat_index]=np.sum(np.array(list(self.bow_dicts[cat_index].values())))+1 # |v| is remaining to be added
#get all words of this category
all_words+=self.bow_dicts[cat_index].keys()
#combine all words of every category & make them unique to get vocabulary -V- of entire training set
self.vocab=np.unique(np.array(all_words))
self.vocab_length=self.vocab.shape[0]
#computing denominator value
denoms=np.array([cat_word_counts[cat_index]+self.vocab_length+1 for cat_index,cat in enumerate(self.classes)])
self.cats_info=[(self.bow_dicts[cat_index],prob_classes[cat_index],denoms[cat_index]) for cat_index,cat in enumerate(self.classes)]
self.cats_info=np.array(self.cats_info)
def test(self,test_set):
predictions=[] #to store prediction of each test example
for example in test_set:
#preprocess the test example the same way we did for training set exampels
cleaned_example=preprocess_string(example)
#simply get the posterior probability of every example
post_prob=self.getExampleProb(cleaned_example) #get prob of this example for both classes
#simply pick the max value and map against self.classes!
predictions.append(self.classes[np.argmax(post_prob)])
return np.array(predictions)
nb=NaiveBayes(np.unique(train_labels)) #instantiate a NB class object
print ("---------------- Training In Progress --------------------")
nb.train(train_data,train_labels) #start tarining by calling the train function
print ('----------------- Training Completed ---------------------')
pclasses=nb.test(test_data) #get predcitions for test set
#check how many predcitions actually match original test labels
test_acc=np.sum(pclasses==test_labels)/float(test_labels.shape[0])
print ("Test Set Examples: ",test_labels.shape[0]) # Outputs : Test Set Examples: 1502
print ("Test Set Accuracy: ",test_acc*100,"%") # Outputs : Test Set Accuracy: 93.8748335553 %
We have tried to answer the following questions in our analysis:
- What are the key factors in predicting health “scores” of the restaurants in Los Angeles county?
- What are the most important factors in classifying restaurants into different “grades”?
- Are there any relationships between various types of health code violations and scores/grades of a restaurant?
- Are there any patterns in terms of how health scores of restaurants change over time?
Project is: finished. Our team was the winner of the INFORMS Analytics Challenge 2019. Our college, The University of Texas at Dallas has published an article detailing the account of the competitions win by our team "Linear Digressors".
The cover photo of the Presentation template is to replicate the Los-Angeles skyline. Los Angeles skyline silhouette design | designed by Vexels
Created by me and my teammate Siddharth Oza, and Harsh Gupta.
If you loved what you read here and feel like we can collaborate to produce some exciting stuff, or if you just want to shoot a question, please feel free to connect with me on LinkedIn