Fabrizio Carinci,
Monash Institute of Health
Services Research, Monash University, Clayton, Australia
Antonio Ciampi,
Department of Epidemiology,
McGill University, Montreal, Canada
2002
Recently, several contributions from the field of machine learning1-5 have proposed
innovative ways of extracting information from data stored across distributed
databases, including methods to synthetise, or meta-learn, prediction
models from partial structures computed at local nodes.
The implementation of such sophisticated approaches is extremely appealing,
since it would allow
to use modern web technology to empower a multitude of remote users
disseminated across geographically distributed networks.
So far, results
have been extremely encouraging. Simulations1-4 have shown that the
accuracy of meta-learning models may not only match that achieved by traditional
techniques based on centralised storage, but that they could also improve it
significantly.
The implementation of distributed technology may represent a major social
innovation too.
Decentralised responsibility would not only minimise problems of storage and
data transmission across nodes, but also enhance the role and independence of
collaborative groups and local communities. This strategy would also act as a
spin-off for better quality, higher speed and extended coverage of information
currently available.
Such scenario would be most beneficial for health systems, a fundamental sector of
human activity where knowledge discovery and exchange are constantly hampered
by practical issues related to costs, privacy and ownership of individual data.
Unfortunately, most theoretical developments in machine learning and artificial intelligence
have focused on improved accuracy in classification5, rather than
generalised regression modelling, a technique of critical relevance in most
health applications.
The RECPAM
approach6 offers a unique solution to cover both aspects quite in a
versatile way, and it is ideally tuned for epidemiological and biostatistical
problems. Such a method is not only powerful, but it is also a very general
one, since it is based on generalised regression modelling and the likelihood
ratio as a measure of information gain.
However, it currently
lacks meta-learning functionalities.
Objectives
The objective of the REIGN project is to merge latest technology from artificial
intelligence and modern statistics to deploy a comprehensive approach using
generalised regression trees as a framework for meta-learning from distributed
databases.
Methods
A possibility to
extend the features of RECPAM is offered by recent innovations in statistical
techniques, involving data manufacturing7, stacking8, and
combining estimates9,10.
We have identified the following methodology to test the performance
of distributed RECPAM trees.
Imagine that a specific problem relative to a particular population,
say a large change in glycated haemoglobin in diabetic patients, may be measured
using standardized criteria and that other baseline characteristics are simultaneously
recorded. Suppose also that a standardised minimum dataset is available on a large scale,
say at the national level.
In this case, longitudinal data analysis on the complete dataset taking into account inter-practice variability of estimates
and intra-practice correlation would be the most appropriate design. However,
corporate, legal and social implications may not allow medical records
transmission from the point of care to a central agency. This is a major
impediment to the application of traditional methods.
Approximation methods proposed here would serve the
scope of model building at the national level leaving data physically stored in
decentralised locations, and not requiring transmission of medical records.
For simplicity, let’s suppose that patients do not
travel across centres (sites) and no practice components is present.
Consequently, logistic regression trees could be used
as a predictive tool to stratify the population in groups (RECPAM classes)
associated to poorer outcome in terms of abnormal changes in glycated
haemoglobin, adjusting for population-local and population-global confounders.
The problem of extracting a global RECPAM model may be approached through four steps: 1) local
modelling (site learning); 2) combination via
stacking (site meta-learning
); 3) data manufacturing
(born again RECPAM); 4) parallel RECPAM (
network learning
).
Step 1. Site learning
Suppose that there are K distributed data sets.
This step simply consists in applying the RECPAM approach to grow a large tree
based on covariates and indicator variables included in the design matrix
z at each of the
Nk, k=1,..K sites. The site-specific RECPAM model,
denoted as Rk (z
), is then transferred to a meta-engine, which coordinates the distribution of
information aggregated by site and shared across the system.
Step 2. Site meta-learning
A meta-engine computes all site-specific meta-models using a stacked regressions
approach8. A total of K RECPAM models are estimated, combining K-1 models
relative to all sites except
Nk,k=1,..,K
.
Such approach to combine estimates is based on results
showing particularly good performance as compared to other alternatives,
such as arcing or bagging5.
The problem basically consists in finding a set of non-negative weights
for the
K-1
RECPAM large tree models through least squares method.
The site meta-algorithm then becomes a linear combination:
The justification for using non-negative constraints
on the linear parameter is based on results from Breiman8, who found
that the resulting model showed consistently better prediction error than
unbounded estimates5. Ting has also observed that, although bounding
the coefficients does not improve accuracy over the simple least squares, it
simplifies the interpretation in terms of exploring the impact of single models
over the prediction outputs5. We believe this might be an important
feature that would allow a direct use meta-learning at the site level, prior
and independently with respect to estimation of the RECPAM model at the network
level.
Therefore, although a site meta-model will be still kind of a black-box
without a single tree structure, it can be used for two important purposes:
a) accurate algorithm available locally b) implementation of step 3.
Step 3. Born again RECPAM
Consistently with the original technique defined as “born again trees”7,
dataset at each site
Ni,i=1,..N is manufactured as follows.
A threshold number palt
=0.5 is selected. The data matrix includes the outcome,
Yi
, and covariates
Xij, j=1,..,M
relative to each observation. A single record
i
is selected at random. Also, for
j=1,..,M,
a number
r
in [0,1] is extracted at random. If
r
is greater than
palt,
than a manufactured value
xm
is defined as
xmij= xij
. Otherwise, it is defined selecting at random from {
xij,I=1,..,N} (
palt smearing
). A sample size of NS is selected. Data are manufactured until NS observations
land in current RECPAM node candidate for splitting. Process is repeated across all sites.
A manufactured outcome is defined by submitting the NS observations to the combined
site meta-model obtained from step 3.
Step 4. Network meta-learning
At each site, the
sequence of likelihood ratios for all tree covariates is recorded. The split
defining question corresponding to the highest average likelihood ratio is used
to split the current node. After the split, if either child node contains no
instance of the training set in any of the sites, the parent is declared
a terminal node and leave it unsplit. The process is repeated until the large
tree is constructed. Same manufacturing and split criterion apply for pruning
and amalgamation. Cross-validation is applied repeating the process ten times
on approximately 9/10 of the sites, following the original RECPAM procedure on
centralised dataset10
Performance evaluation
The performance of new algorithms will be evaluated by simulation over shared
datasets. Relevant data will be available from health
services research, taken as the target area of our investigation.
Selected performance indicators will be the deviance,
AIC, BIC, mean square error, generalised r-square, and for logistic regression:
c-index (area under the curve) and error rate.
The indicators for REIGN will be reported against
values obtained through traditional application of the RECPAM model on the
entire training sample equivalent to the union of the disjoint partitions, as
well as against results from competitive models such as logistic regression by
backward elimination.
Computation time
Suppose the computation time required for a traditional RECPAM run on the complete
database is T. Assume also that for the new agorithm described above on K disjoint
partitions the time required for each step is the following: C for site trees (Step1),
L for linear combinations (Step 2), M for RECPAM manufacturing at the site level (Step 3),
and G for network meta-learning (Step 4), the latter to be multiplied by a J-fold
cross-validation factor.
The total time for meta-learning will be then equal to K[C+L+(M+G)(J+1)].
Therefore, the additional time required in computation by the REIGN algorithm relative to
traditional RECPAM will be equal to
D=(T-KC)*(J+1)-K([L+(M+G)(J+1)].
If we assume that for large databases of average size the nonlinear price to
pay for execution on centralised datasets would be at least equivalent to the
process of manufacturing for the whole disjoint partitions, than the ‘price’ of
REIGN would be equal to
D=K([L+G(J+1)].
Given that the computation will normally occur over many different computers,
the total running time should not significantly exceed that of single-site RECPAM.
Additional features
The REIGN project will also design, implement and test the following:
1)
cross-validation by parallel processing:
a large tree will be pruned using the procedure adopted by Ciampi et al.
10 over multiple processors/computers
2) generalised marginal tree models for
repeated measurements:
the GEE model will be used to take into
account possible bias generated by intra-cluster correlated outcomes
- generalised mixed regression trees:
a random effect component will be added
according to the theory of mixed models, to explicitly model variance of
parameters when clustering of effects is present.
- global algorithm:
recursive partitioning will be attempted using a global algorithm, i.e. computing
likelihood ratios at each step over the entire training sample, rather than
performing generalised linear modelling or proportional hazards locally at the
current node.
- multinomial and negative binomial models:
these models will be added to the range of current options available.
Software
The REIGN project will build upon RECPAM/SAS
11,12 software realised to implement the traditional RECPAM approach.
Software in the form of SAS macros will be redesigned and adapted for the new
procedures and related applications.
The REIGN project has received seed funding in 2002 from the VPAC
(Victorian Partnership for Advanced Computing).
Data analysis and statistical innovation will contribute to the preparation of articles,
which will partly be included in the PhD thesis of F.Carinci, enrolled as a Monash staff
without supervisor.
Completion of the degree is expected by the end of year 2008.
References
1.
Tsoumakas G, Vlahavas I, Effective Stacking of Distributed Classifiers,
in F. van Harmelen (ed.), Proceedings of the 15th European Conference on
Artificial Intelligence, IOS Press, 2002, Amsterdam. http://users.auth.gr/~greg/Publications/ecai2002.pdf
2.
Prodromidis A, Chan P, Stolfo S, Meta-learning in distributed data
mining systems: Issues and approaches, in Advances in Distributed and Parallel
Knowledge Discovery, MIT Press, 2000.
http://www.cs.fit.edu/~pkc/papers/ddmbook.ps
3.
Philip K. Chan, Salvatore J. Stolfo: On the Accuracy of Meta-Learning
for Scalable Data Mining. JIIS 1997; 8(1): 5-28. http://www.cs.columbia.edu/~pkc/papers/jiis97.ps
4.
Chan P, Stolfo S, Sharing Learned Models among Remote Database Partitions
by Local Meta-learning, Proc. Second Intl. Conf. on Knowledge Discovery &
Data Mining, 1996: 2-7.
http://www.cs.columbia.edu/~pkc/papers/kdd96.ps
5.
Ting, K.M. and Witten, I.H., Issues in Stacked Generalization, Journal
of Artificial Intelligence Research 1999:10, 271-289.
http://www.cs.waikato.ac.nz/~ml/publications/1999/99KMT-IHW-Issues.pdf
6.
Ciampi A., Constructing Predictions Trees from Data: the RECPAM approach,
in Computational Aspects of Model Choice, 105-52, Physica-Verlag, Heidelberg,
1992.
7.
Breiman L Shang N, Born again trees, ftp://ftp.stat.berkeley.edu/pub/users/breiman/BAtrees.ps,
1996
8.
Breiman L, Stacked regressions. Machine Learning 1996; 24:41-8.
9.
LeBlanc M, Tibshirani R, Combining estimates in regression and classification,
JASA, 1996; 91(436): 1641-50.
10. Ciampi A., Negassa
A., Lou. Z., Tree-structured prediction
for censored survival data and the Cox model, Journal of Clinical Epidemiology,
1995; 48(5): 675-689.
11. Carinci F, Pellegrini
F, RECPAM/SAS, a statistical tool for criterion-driven data mining, Monash
Institute of Health Services Research Technical Report, http://www.fabcarinci.net/reports/Recpam_booklet.pdf
12. Carinci F,
Nicolucci A, Pellegrini F, Regression trees in health services and outcomes
research: an application of the RECPAM approach using qualità of care as a
criterion, Monash Institute of Health Services Research Technical Report, http://www.fabcarinci.net/reports/Recpam_booklet.pdf