Thursday, May 1, 2025

Classification of Airline Customer Data

 

Introduction

The task under consideration is for classification of airline customer data to predict satisfaction. The data consists of a range of attributes including continuous variables such as departure delay and flight distance, and ordinal data rated 0 or 1 to 5 from passenger surveys. There is also some categorical data such as gender and whether the customer has a loyalty card.  Unlike in other classification problems such as fraud detection or loan default prediction, satisfaction is split with a balance of 44% ‘satisfied’ and 56% ‘neutral or unsatisfied’. A few of the attributes show a significant amount of missing values – nearly 30% in some cases. Here, values were imputed using the mean or median, or dealt with in other ways. Two machine learning techniques are commonly found in the literature as being appropriate for the task of classification: naïve Bayes and random forest and these are the subject of this comparison work. Naïve Bayes is a simple probabilistic algorithm that uses Bayes theorem and assumes feature independence. Random forest algorithms are collections of decision trees each trained on a random subset of the data where the final classification is based on the majority one. This is also termed an ‘ensemble’ technique. Both are simple to implement and robust in that they are not regarded as prone to overfitting. Despite their simplicity, both have been reported as “surprisingly accurate” in use.

Results

Feature selection was undertaken by successively removing data using the column filter, one column at a time. Table 1 shows the change in scores as features were removed, with negative changes (more than -0.002 to allow for rounding differences) highlighted in pink. Positive differences (more than +0.002) are highlighted in green. If the score is reduced on feature removal it means that the model is negatively affected – the feature contributes to the model. Conversely if the score increases the model gets better without the feature.

With the random forest, most of the features make a contribution. For naïve Bayes, there are a number of features whose removal improves the model. 

Table 1: Feature selection - effect on model scores

 

Random Forest

Naïve Bayes

Filtered Out

F-score

AUC

F-score

AUC

None

0.892

0.968

0.778

0.881

Gender

0.887

0.967

0.786

0.882

Customer loyalty

0.871

0.958

0.785

0.878

Age

0.889

0.968

0.784

0.882

Type of Travel

0.859

0.955

0.756

0.866

Class

0.886

0.967

0.771

0.875

Online check-in

0.888

0.964

0.780

0.876

Flight Distance

0.890

0.967

0.775

0.882

Departure/Arrival time convenient

0.889

0.966

0.784

0.882

Ease of Online booking

0.873

0.948

0.779

0.875

Gate location

0.873

0.948

0.785

0.882

Food and drink

0.886

0.967

0.787

0.884

Seat comfort

0.882

0.964

0.777

0.883

Inflight entertainment

0.890

0.967

0.786

0.888

On-board service

0.888

0.967

0.783

0.882

Leg room service

0.889

0.966

0.788

0.880

Baggage handling

0.890

0.965

0.781

0.884

Checkin service

0.884

0.964

0.792

0.879

Inflight service

0.886

0.966

0.784

0.883

Cleanliness

0.885

0.964

0.782

0.883

Departure delay in minutes

0.890

0.968

0.795

0.882

Arrival delay in minutes

0.892

0.967

0.793

0.882


Using these set-ups, the random forest model showed a slight decline in both F-score and AUC. The naïve Bayes model was improved - see table 2

Table 2: Results summary before and after tuning and feature selection

 

Random Forest

Naïve Bayes

 

F-score

AUC

F-score

AUC

Base model

0.892

0.968

0.778

0.881

Tuned parameters and features

0.886

0.966

0.789

0.887

 Conclusions

Overall, the model performance was at a high level, with general ‘accuracy’ in the 80-90% range and recall 80-85%. This proved hard to improve on whether through tuning parameters in the random forest model or selecting features to include. Conversely, this makes them simple to set up and run. Kelleher et al (2015) declare that naïve Bayes models are often used “to define a baseline accuracy score” because they are so easy to implement.

Predicting customer satisfaction was not the aim here. What the model might do is allow the airline to analyse which features are important contributors to satisfaction, which means studying ‘accuracy’ in terms of the positive outcomes – hence why evaluation measures such as recall, F-1 and AUC were chosen. This worked well with the naïve Bayes model, where we saw a clear outcome from feature selection with some attributes contributing positively to the model and some negatively. For the random forest algorithm this was not so successful as all attributes made a mild positive contribution. In terms of interpretability a single decision tree may have been preferable, although that would have brought other problems such as a tendency to overfitting and probably lower accuracy. Binary logistic regression is another method that might have been useful as the outputs are explicit in showing attribute contribution. Another option would be to use several models together. Khan et al (2024) review the literature on classification problems (albeit for class-imbalance problems) and conclude ensemble methods generally show better performance. 

References

Kelleher, J., Mac Namee, B., & D’Arcy, A. (2015) Fundamentals of machine learning for predictive data analytics. MIT Press.

Khan, A., Chaudhari, O., & Chandra, R. (2024) A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. (2024). Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2023.122778

Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...