Software
in progress

REIGN
RECPAM Information Generation Network

Parallel Regression Trees
for Multilevel Information Systems and Data Mining

Fabrizio Carinci,
Monash Institute of Health Services Research, Monash University, Clayton, Australia

Antonio Ciampi,
Department of Epidemiology, McGill University, Montreal, Canada

2002

Recently, several contributions from the field of machine learning^1-5 have proposed innovative ways of extracting information from data stored across distributed databases, including methods to synthetise, or meta-learn, prediction models from partial structures computed at local nodes.

The implementation of such sophisticated approaches is extremely appealing, since it would allow to use modern web technology to empower a multitude of remote users disseminated across geographically distributed networks.

So far, results have been extremely encouraging. Simulations^1-4 have shown that the accuracy of meta-learning models may not only match that achieved by traditional techniques based on centralised storage, but that they could also improve it significantly.

The implementation of distributed technology may represent a major social innovation too. Decentralised responsibility would not only minimise problems of storage and data transmission across nodes, but also enhance the role and independence of collaborative groups and local communities. This strategy would also act as a spin-off for better quality, higher speed and extended coverage of information currently available.

Such scenario would be most beneficial for health systems, a fundamental sector of human activity where knowledge discovery and exchange are constantly hampered by practical issues related to costs, privacy and ownership of individual data.

Unfortunately, most theoretical developments in machine learning and artificial intelligence have focused on improved accuracy in classification⁵, rather than generalised regression modelling, a technique of critical relevance in most health applications.

The RECPAM approach⁶ offers a unique solution to cover both aspects quite in a versatile way, and it is ideally tuned for epidemiological and biostatistical problems. Such a method is not only powerful, but it is also a very general one, since it is based on generalised regression modelling and the likelihood ratio as a measure of information gain.

However, it currently lacks meta-learning functionalities.

Objectives

The objective of the REIGN project is to merge latest technology from artificial intelligence and modern statistics to deploy a comprehensive approach using generalised regression trees as a framework for meta-learning from distributed databases.

Methods

A possibility to extend the features of RECPAM is offered by recent innovations in statistical techniques, involving data manufacturing⁷, stacking⁸, and combining estimates^9,10.

We have identified the following methodology to test the performance of distributed RECPAM trees.

Imagine that a specific problem relative to a particular population, say a large change in glycated haemoglobin in diabetic patients, may be measured using standardized criteria and that other baseline characteristics are simultaneously recorded. Suppose also that a standardised minimum dataset is available on a large scale, say at the national level.

In this case, longitudinal data analysis on the complete dataset taking into account inter-practice variability of estimates and intra-practice correlation would be the most appropriate design. However, corporate, legal and social implications may not allow medical records transmission from the point of care to a central agency. This is a major impediment to the application of traditional methods.

Approximation methods proposed here would serve the scope of model building at the national level leaving data physically stored in decentralised locations, and not requiring transmission of medical records.

For simplicity, let’s suppose that patients do not travel across centres (sites) and no practice components is present.

Consequently, logistic regression trees could be used as a predictive tool to stratify the population in groups (RECPAM classes) associated to poorer outcome in terms of abnormal changes in glycated haemoglobin, adjusting for population-local and population-global confounders.

The problem of extracting a global RECPAM model may be approached through four steps: 1) local modelling (site learning); 2) combination via stacking (site meta-learning ); 3) data manufacturing (born again RECPAM); 4) parallel RECPAM ( network learning ).

Step 1. Site learning

Suppose that there are K distributed data sets.

This step simply consists in applying the RECPAM approach to grow a large tree based on covariates and indicator variables included in the design matrix z at each of the N_k, k=1,..K sites. The site-specific RECPAM model, denoted as R_k (z ), is then transferred to a meta-engine, which coordinates the distribution of information aggregated by site and shared across the system.

Step 2. Site meta-learning

A meta-engine computes all site-specific meta-models using a stacked regressions approach⁸. A total of K RECPAM models are estimated, combining K-1 models relative to all sites except N_k,k=1,..,K .

Such approach to combine estimates is based on results showing particularly good performance as compared to other alternatives, such as arcing or bagging⁵.

The problem basically consists in finding a set of non-negative weights for the K-1 RECPAM large tree models through least squares method.

The site meta-algorithm then becomes a linear combination:

The justification for using non-negative constraints on the linear parameter is based on results from Breiman⁸, who found that the resulting model showed consistently better prediction error than unbounded estimates⁵. Ting has also observed that, although bounding the coefficients does not improve accuracy over the simple least squares, it simplifies the interpretation in terms of exploring the impact of single models over the prediction outputs⁵. We believe this might be an important feature that would allow a direct use meta-learning at the site level, prior and independently with respect to estimation of the RECPAM model at the network level.

Therefore, although a site meta-model will be still kind of a black-box without a single tree structure, it can be used for two important purposes: a) accurate algorithm available locally b) implementation of step 3.

Step 3. Born again RECPAM

Consistently with the original technique defined as “born again trees”⁷, dataset at each site N_i,i=1,..N is manufactured as follows.

A threshold number palt =0.5 is selected. The data matrix includes the outcome, Y_i , and covariates X_ij, j=1,..,M relative to each observation. A single record i is selected at random. Also, for j=1,..,M, a number r in [0,1] is extracted at random. If r is greater than palt, than a manufactured value xm is defined as xm_ij= x_ij . Otherwise, it is defined selecting at random from { x_ij,I=1,..,N} ( palt smearing ). A sample size of NS is selected. Data are manufactured until NS observations land in current RECPAM node candidate for splitting. Process is repeated across all sites. A manufactured outcome is defined by submitting the NS observations to the combined site meta-model obtained from step 3.

Step 4. Network meta-learning

At each site, the sequence of likelihood ratios for all tree covariates is recorded. The split defining question corresponding to the highest average likelihood ratio is used to split the current node. After the split, if either child node contains no instance of the training set in any of the sites, the parent is declared a terminal node and leave it unsplit. The process is repeated until the large tree is constructed. Same manufacturing and split criterion apply for pruning and amalgamation. Cross-validation is applied repeating the process ten times on approximately 9/10 of the sites, following the original RECPAM procedure on centralised dataset¹⁰

Performance evaluation

The performance of new algorithms will be evaluated by simulation over shared datasets. Relevant data will be available from health services research, taken as the target area of our investigation.

Selected performance indicators will be the deviance, AIC, BIC, mean square error, generalised r-square, and for logistic regression: c-index (area under the curve) and error rate.

The indicators for REIGN will be reported against values obtained through traditional application of the RECPAM model on the entire training sample equivalent to the union of the disjoint partitions, as well as against results from competitive models such as logistic regression by backward elimination.

Computation time

Suppose the computation time required for a traditional RECPAM run on the complete database is T. Assume also that for the new agorithm described above on K disjoint partitions the time required for each step is the following: C for site trees (Step1), L for linear combinations (Step 2), M for RECPAM manufacturing at the site level (Step 3), and G for network meta-learning (Step 4), the latter to be multiplied by a J-fold cross-validation factor.

The total time for meta-learning will be then equal to K[C+L+(M+G)(J+1)].

Therefore, the additional time required in computation by the REIGN algorithm relative to traditional RECPAM will be equal to D=(T-KC)*(J+1)-K([L+(M+G)(J+1)]. If we assume that for large databases of average size the nonlinear price to pay for execution on centralised datasets would be at least equivalent to the process of manufacturing for the whole disjoint partitions, than the ‘price’ of REIGN would be equal to D=K([L+G(J+1)].

Given that the computation will normally occur over many different computers, the total running time should not significantly exceed that of single-site RECPAM.

Additional features

The REIGN project will also design, implement and test the following:

1) cross-validation by parallel processing:

a large tree will be pruned using the procedure adopted by Ciampi et al. ¹⁰ over multiple processors/computers

2) generalised marginal tree models for repeated measurements:

the GEE model will be used to take into account possible bias generated by intra-cluster correlated outcomes

- generalised mixed regression trees:

a random effect component will be added according to the theory of mixed models, to explicitly model variance of parameters when clustering of effects is present.

- global algorithm:

recursive partitioning will be attempted using a global algorithm, i.e. computing likelihood ratios at each step over the entire training sample, rather than performing generalised linear modelling or proportional hazards locally at the current node.

- multinomial and negative binomial models:

these models will be added to the range of current options available.

Software

The REIGN project will build upon RECPAM/SAS ^11,12 software realised to implement the traditional RECPAM approach. Software in the form of SAS macros will be redesigned and adapted for the new procedures and related applications.

The REIGN project has received seed funding in 2002 from the VPAC (Victorian Partnership for Advanced Computing). Data analysis and statistical innovation will contribute to the preparation of articles, which will partly be included in the PhD thesis of F.Carinci, enrolled as a Monash staff without supervisor. Completion of the degree is expected by the end of year 2008.

References

1. Tsoumakas G, Vlahavas I, Effective Stacking of Distributed Classifiers, in F. van Harmelen (ed.), Proceedings of the 15th European Conference on Artificial Intelligence, IOS Press, 2002, Amsterdam. http://users.auth.gr/~greg/Publications/ecai2002.pdf

2. Prodromidis A, Chan P, Stolfo S, Meta-learning in distributed data mining systems: Issues and approaches, in Advances in Distributed and Parallel Knowledge Discovery, MIT Press, 2000.

http://www.cs.fit.edu/~pkc/papers/ddmbook.ps

3. Philip K. Chan, Salvatore J. Stolfo: On the Accuracy of Meta-Learning for Scalable Data Mining. JIIS 1997; 8(1): 5-28. http://www.cs.columbia.edu/~pkc/papers/jiis97.ps

4. Chan P, Stolfo S, Sharing Learned Models among Remote Database Partitions by Local Meta-learning, Proc. Second Intl. Conf. on Knowledge Discovery & Data Mining, 1996: 2-7.

http://www.cs.columbia.edu/~pkc/papers/kdd96.ps

5. Ting, K.M. and Witten, I.H., Issues in Stacked Generalization, Journal of Artificial Intelligence Research 1999:10, 271-289.

http://www.cs.waikato.ac.nz/~ml/publications/1999/99KMT-IHW-Issues.pdf

6. Ciampi A., Constructing Predictions Trees from Data: the RECPAM approach, in Computational Aspects of Model Choice, 105-52, Physica-Verlag, Heidelberg, 1992.

7. Breiman L Shang N, Born again trees, ftp://ftp.stat.berkeley.edu/pub/users/breiman/BAtrees.ps, 1996

8. Breiman L, Stacked regressions. Machine Learning 1996; 24:41-8.

9. LeBlanc M, Tibshirani R, Combining estimates in regression and classification, JASA, 1996; 91(436): 1641-50.

10. Ciampi A., Negassa A., Lou. Z., Tree-structured prediction for censored survival data and the Cox model, Journal of Clinical Epidemiology, 1995; 48(5): 675-689.

11. Carinci F, Pellegrini F, RECPAM/SAS, a statistical tool for criterion-driven data mining, Monash Institute of Health Services Research Technical Report, http://www.fabcarinci.net/reports/Recpam_booklet.pdf

12. Carinci F, Nicolucci A, Pellegrini F, Regression trees in health services and outcomes research: an application of the RECPAM approach using qualità of care as a criterion, Monash Institute of Health Services Research Technical Report, http://www.fabcarinci.net/reports/Recpam_booklet.pdf