Analyzing imbalanced online consumer review data in product design using geometric semantic genetic programming

https://doi.org/10.1016/j.engappai.2021.104442Get rights and content

Highlights

  • Customer satisfaction (CS) models are developed by online review data.

  • Data is imbalanced. Unpopular products have fewer; Popular products have more.

  • Multi-objective genetic programming is used to tackle non-balanced data.

  • Product designers select the most preferred model among all nondominance CS models.

  • Better prediction can be achieved by the proposed nondominance CS models.

Abstract

To develop a successful product, understanding the relationship between customer satisfaction (CS) and design attributes of a new product is essential. Nowadays IoT technologies are used to collect online review data from social media. More representative CS models are developed using online review data. However, online review data is imbalanced, since popular products receive more online consumer reviews and unpopular products receive less. When imbalanced data is used, CS models learn the characteristics of majority data while rarely learning minority data. Misleading analysis for product development is made since the CS model is biased to popular products. This paper proposes an approach to generate nondominated CS models which learn equally to imbalanced data from popular and unpopular products. A multi-objective optimization problem is formulated to learn equally in imbalanced data. This problem is proposed to be solved by the geometric semantic genetic programming (GSGP); a Pareto set of nondominated CS models is generated by the GSGP. Product designers select the most preferred models in the Pareto set. The preferred nondominated CS model attempts to tradeoff unpopular and popular products, to determine optimal design attributes and maximize the CS. The case study shows that the proposed GSGP is able to generate CS models with more accurate CS predictions compared to the commonly used methods. The proposed GSGP also generates a Pareto set of nondominated CS models which equally learn consumer reviews for those dryers. Based on the Pareto set, the design team selects the most preferred CS model.

Introduction

The success rate of developing a popular product is only 10% or less because of the rapidly increasing competition in consumer markets.1 To increase the success rates in the marketplace, product designers need to determine optimal design attributes of a potential product which attempts to maximize customer satisfaction (CS) (Chan et al., 2012). In the product design industries, houses of quality matrixes (HoQs) (Li et al., 2011) are generally developed to correlate the relationship between CS and the design attributes. Based on the HoQs, products with optimal design attributes are more likely to be produced. However, HoQ is developed based on the subjective experience of product designers; the HoQ does not fully reflect the majority of consumer opinions; unpopular products are likely to be designed when inappropriate HoQs are used. To investigate wider consumer opinions, consumer surveys with questionnaires or interviews are conducted on potential consumers. Based on the collected survey data, an empirical CS model is developed in order to illustrate the relationship between the CS and the design attributes (Chan et al., 2019). The empirical model is used to determine the optimal design attributes to maximize CS. However, the design of consumer surveys is subjective and is not likely to cover wide consumer opinions and every product. In addition, customers surveys are time-consuming and do not reflect the most up-to-date consumer opinions.

Thanks to IoT technologies, more than 2.5 quintillion bytes of online customer review data are collected daily from product webs, consumer blogs, and social media2 such as amazon.com, facebook, twitter. These online customer reviews are useful to identify product preferences, customer needs and marketing strategies (Bello-Orgaz and Jason, 2016). Many consumers are influenced by online customer reviews prior to purchase a product (Jin et al., 2016). A more representative CS model is developed when online customer reviews are used, since the reviews cover a wider consumer domain compared to survey data (Halavais, 2015). Online customer reviews have been used to develop empirical CS models for product design (Chan et al., 2019, Jiang et al., 2018, Jiang et al., 2019b).

More consumers use popular products; more consumer opinions are expressed on product webs or consumer blogs. Therefore, more online customer reviews can be collected, compared to the unpopular or rarely used products. Many marketing studies such as printer design (Jin et al., 2014, Jin et al., 2015), washing machine design (Kim and Noh, 2019), and toy and game design (Wang et al., 2019a, Wang et al., 2019b) show that the more popular products received more online consumer reviews compared to the unpopular products. When a CS model is developed based on online consumer reviews, the CS model mostly learns the consumer buying behaviour on popular products, which is the data majority; the CS model rarely learns on unpopular products which is minor data. When product designers conduct analysis using the CS model, misleading analysis is possibly made since the data is imbalanced; the analysis is biased to popular products. Although unpopular products are not welcomed by current consumers, it is possible that they will be popular in the future. The design attributes and the online consumer reviews of unpopular products need to be learned fairly to develop the CS model. Otherwise some significant product characteristics are not learned.

In this decade, modelling with imbalanced data has been identified as a challenging problem in data mining or machine learning (Tahir et al., 2019). Weighting methods are used for the binary classification to classify samples into positive or negative classes (Maldonado and Lopez, 2014, Garcia-Pedrajas et al., 2013). Particularly, the number of positive samples is much smaller than that of negative samples, although positive classes are generally more interesting for classification (Lu et al., 2017). Generally the weighting method is used to weight the sample class which has more impact on the classification accuracy. By doing so, the models perform effectively on an interested class. However, this approach requires human judgement of the class impact, which is subjective. Wang et al. (2019a) and Gundel et al. (2019) proposed an automatic weighting method for positive and negative classes. More accurate predictions for both positive and negative classes are achieved in biomedical diagnosis. However, the weighting approaches are only developed for binary classifications which are involved with only two classes, positive and negative classes. The approach for parametrical modelling with more than two classes is still lacking. Therefore it is necessary to develop a modelling approach in order to generate the relation between CS and the design attributes products, since the data is collected from more than two products with multi-classes and the data is mostly imbalanced.

In this paper, we have formulated the imbalanced data problem into a multi-objective optimization problem which unbiasedly learns the data collected from numerous products. The CS model learns the data equally for popular products and unpopular products. This multi-objective problem can be solved by either the scalarization method or a Pareto method such as evolutionary computation (Gunantara, 2018, Giagkiozis and Fleming, 2015). The scalarization method transforms the multi-objective function to a single scalar function which weights and sums the learning errors to all products. The weights are determined by the judgements of product designers which are subjective. Misleading models are likely to be generated to forecast wrong, popular products if inappropriate weighting is made. In this paper, a Pareto method, evolutionary computation (Giagkiozis et al., 2015), is proposed. The method is able to automatically find a nondominated CS model which learns the data equally from all products. Here the genetic programming approach namely Geometric Semantic Genetic Programming (GSGP) (Castelli et al., 2014) is proposed since the regression based genetic programming approach is effective in modelling relationships between CS and the design attributes of products which are nonlinear (Chan et al., 2011). The GSGP is faster than regression based genetic programming (Chan et al., 2011) which needs to generate the model structures and compute the model coefficients. GSGP guarantees that feasible models are generated and also the model accuracies are generally higher than the models generated by regression based genetic programming. The GSGP also overcomes the limitation of the nonlinear modelling approaches, such as neural network and deep learning, which only generates black-box models which are not used by the product designers. The GSGP also overcomes the limitation of linear modelling approaches such as statistical regression and fuzzy regression which are not effective to model the nonlinear relationship between CS and design attributes. Instead of generating a single model as in the other approaches, the GSGP is able to generate a set of nondominated models in the Pareto set of which the nondominated CS models learn equally well on all products. These nondominated CS models are able to tradeoff some contradictory design issues in the products. When the various design and marketing scenarios are considered, the product design team can select the most appropriate nondominated CS model in order to determine the preferred target values settings of the design attributes and maximize the CS.

The effectiveness of the proposed GSGP is validated by a case study of an electric hair dryer design which is involved with imbalanced online review data. We have identified numerous competitive dryers from the market, where some are more popular and some are more unpopular. The popular dryers received more data; the unpopular one received less. The results obtained by the proposed GSGP-Algorithm are compared to those obtained by the state-of-the-art methods, linear regression (Lin and Wei, 2016), fuzzy regression (Jiang et al., 2019a), and the genetic programming (regression-GP) (Chan et al., 2011, Chan et al., 2020) for generating models which relate CS and the design attributes. The results show that the proposed GSGP is able to generate more accurate models. Also the proposed GSGP is able to generate a family of numerous models in the Pareto set which learn the electric hair dryers equally. This case study also demonstrates how the design team selects the most appropriate model based on their marketing knowledge and insight of each product.

The contribution of the paper is summarized as follows:

  • A multi-objective optimization problem is formulated to tackle the imbalanced data problem of which the data collected from numerous products is unbiased. The formulated problem is implemented in the GSGP Algorithm in order to generate CS models which learn data equally for both popular and unpopular products. Numerical results shows CS models with better generalization capability can be generated by the GSGP Algorithm.

  • The proposed GSGP Algorithm is able to generate a set of nondominated CS models which attempt to tradeoff contradictory design attributes in the products. Product designers can select the most appropriate CS model from the model set. The case study has demonstrated how the most appropriate CS model can be selected for new product development.

The rest of this paper is organized as follows. Section 2 discusses how the currently used methods to solve the imbalanced data problems and also discusses the limitations of those currently used methods. Section 3 presents the proposed formulation and the proposed GSGP algorithm for solving the imbalanced data problems. Section 4 presents the case study of an electric hair dryer design. A conclusion is drawn in Section 5.

Section snippets

Imbalanced online reviews for product opinions

In new product development, product designers need to understand the relationship between the dimensions of CS and the design attributes. The relationship is described by the CS model in (1), namely Fs: y=Fs(x̄)=Fs(x1,x2,,xn),where y is a dimension of CS, and x̄={x1,x2,,xn} are the n design attributes of which xi with i=1,2,,n, is the ith design attribute. As an example, we consider the new product development of laptops. The dimensions of CS are the quality, performance, user-friendliness,

CS models for imbalanced online reviews

To overcome the biasing of imbalanced online customer reviews in (2), the multi-objective problem in (5) is proposed. Solving (5) attempts to develop a nondominated CS model which is capable to equally learn the online consumer reviews for all the products. In (5), each norml function measures the difference between the predictions of Fs and the actual reviews of each product. The nondominated CS model is found, when the value of one norml function cannot be decreased without increasing the

A case study of imbalanced online reviews

We consider the design of electric hair dryers which is involved with the imbalanced online reviews. We identified ten competitive dryers from the market. The web scrapping tool illustrated in Fig. 1 was used to collect the online reviews of the ten competitive dryers. Table 2 shows the numbers of online reviews collected from the ten competitive dryers. The table shows that each competitive dryer received different numbers of online reviews. The most popular dryer, Dryer 1, received the

Conclusions and future studies

In this paper, we have formulated a multi-objective optimization problem which attempts to equally learn imbalanced online review data collected from popular and unpopular products in the marketplace. We have also developed the evolutionary programming approach namely GSGP to solve this multi-objective optimization problem. GSGP is proposed since it overcomes the limitations of the commonly used approaches such as statistical and fuzzy regressions which are not effective for modelling nonlinear

CRediT authorship contribution statement

Kit Yan Chan: Conceptualization, Formal analysis, Validation, Writing – original draft, Writing – review & editing, Methodology, Investigation, Validation. C.K. Kwong: Data curation, Conceptualization, Writing – review & editing, Project administration. Huimin Jiang: Data curation, Conceptualization, Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The work described in this paper was supported by a grant from The Hong Kong Polytechnic University (Project No. G-UADL). This work was also partially supported by a grant from National Natural Science Foundation of China (grant number 71901149).

References (41)

  • WangW.M. et al.

    Multiple affective attribute classification of online customer product reviews: A heuristic deep learning method for supporting kansei engineering

    Eng. Appl. Artif. Intell.

    (2019)
  • Box, G.E.P., Hunter, J.S., Hunter, W.G., 2005. Statistics for Experiments: Design, Innovation, and Discovery, second...
  • CastelliM. et al.

    A C++ framework for geometric semantic genetic programming

    Genet. Program. Evol. Mach.

    (2015)
  • CastelliM. et al.

    Semantic search-based genetic programming and the effect of intron deletion

    IEEE Trans. Cybern.

    (2014)
  • Chan, K.Y., Kwong, C.K., Dillon, T.S., 2012. Computational Intelligence Techniques for New Product...
  • ChanK.Y. et al.

    Predicting customer satisfaction based on online reviews and hybrid ensemble genetic programming algorithms

    Eng. Appl. Artif. Intell.

    (2020)
  • ChanK.Y. et al.

    Modelling customer satisfaction for product development using genetic programming

    J. Eng. Des.

    (2011)
  • ChanK.Y. et al.

    Affective design using machine learning: a survey and its prospect of conjoining big data

    Int. J. Comput. Integr. Manuf.

    (2019)
  • Garcia-PedrajasN. et al.

    OligoIs: Scalable instance selection for class-imbalanced data sets

    IEEE Trans. Cybern.

    (2013)
  • GiagkiozisI. et al.

    Methods for multi-objective optimization: an analysis

    Inform. Sci.

    (2015)
  • Cited by (6)

    • SWSEL: Sliding Window-based Selective Ensemble Learning for class-imbalance problems

      2023, Engineering Applications of Artificial Intelligence
    View full text