Publications L Pilote et al Theme Issues: A Global View of Gender Specific Issues Related CMAJ. 2007 March 13; 176(6): S1–S44. K Dasgupta, C Chan, D Da Costa, Walking Behaviour and Glycemic Control in Type 2 Diabetes: L Pilote, M De Civita, N Ross, Seasonal and Gender Differences-Study Design and Methods Karp I, Chen SF, Pilote L Sex differences in the effectiveness of statins aft
Using latent semantic indexing for literature based discoveryUsing Latent Semantic Indexing for Literature
Michael D. Gordon
Computer and Information Systems, School of Business, University of Michigan, Ann Arbor, MI 48109-1234.
Microsoft Research, Redmond, WA 98052. E-mail: firstname.lastname@example.org
Latent semantic indexing ( LSI ) is a statistical technique
As described by Swanson, there are two basic literature for improving information retrieval effectiveness. Here,
discovery processes. The first leads from the literature we use LSI to assist in literature-based discoveries. The
( R ) associated with an initial topic to the literatures ( I ) idea behind literature-based discoveries is that different
of one or more related, intermediate topics. The second authors have already published certain underlying scien-
tific ideas that, when taken together, can be connected
leads from one of these related topics to the literature to hypothesize a new discovery, and that these connec-
( PD ) associated with a potential discovery. Figure 1 illus- tions can be made by exploring the scientific literature.
trates these two steps ( left to right ) .
We explore latent semantic indexing’s effectiveness on
We call these two processes identifying intermediate two discovery processes: uncovering ‘‘nearby’’ relation-
literatures and identifying potential discovery literatures, ships that are necessary to initiate the literature based
discovery process; and discovering more distant rela-
respectively ( Fig. 1 ) . Our interest is learning if latent tionships that may genuinely generate new discovery
semantic indexing ( Deerwester et al., 1990 ) , a statistical hypotheses.
technique used with success in information retrieval, canhelp with either or both of these processes.
Identifying Intermediate Literatures
Literature-based discovery uses the published, scien- tific literature as a source of new discovery. First dis- By definition, if we start with Raynaud’s and discover cussed by Swanson ( 1986 ) in connection with Raynaud’s a brand new concept ( cure, cause, treatment, or physio- disease, the problem can be characterized in this way: logical process ) never before reported, there will be no Beginning with the literature, R ( for Raynaud’s ) , on some document that discusses both Raynaud’s and this new subject, can you identify the literature on another subject concept. But there may be a topic that is discussed along that helps in better understanding R , even though no one with Raynaud’s and is also discussed along with the new has ever thought that these two subjects were related? 1 In concept, even though no single article on this topic dis- a series of papers, Swanson ( 1986a, 1986b, 1987, 1988a, cusses both. A literature that serves as such a bridge is 1988b, 1989a, 1989b, 1989c, 1990a, 1990b, 1991, 1993 ) showed this could be done, both by intensive reading Finding intermediate literatures, then, is a central prob- and study, and by semiautomatic methods involving text lem in literature-based discovery. Of course, one can read analysis. Subsequently, Gordon and Lindsay ( 1996 ) have about Raynaud’s and form impressions on that basis, but replicated Swanson’s results and used other statistical a systematic approach for identifying intermediate litera- methods to help automate the literature discovery process.
tures would be more efficient and possibly more effective.
The following is an example of a MEDLINE record containing the term Raynaud’s ( with slight cosmetic mod- 1 Literature based discoveries generate scientific hypotheses; con- ifications to illustrate more plainly the record’s structure ) : ventional scientific research must be conducted if the hypothesis is tobe confirmed.
TITLE: Localized real-time blood flow measurements.
AUTHOR: van As H; Brouwers AA; Snaar JE Received January 31, 1996; revised April 30, 1997; accepted April 30, CITE: Arch Int Physiol Biochim 1985 Dec; 93 ( 5 ) : 87 – ᭧ 1998 John Wiley & Sons, Inc.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 49 ( 8 ) :674 – 685, 1998 ABSTRACT: A novel method for real time, localized, Four statistics used to identify intermediate literatures.
flow measurements is applied to blood flow in human fingers. Results for arterial and venous flow in normalsubjects and patients with abnormal blood circulation are number of tokensb of X within R presented. Effects of blood flow regulation by the auto- number of records in R containing X nomic nervous system have been observed. Stricture of the digital arteries could be clearly demonstrated in a patient with Raynaud’s phenomenon. Experimental sig- nals due to pulsatile flow in a model system can be simu- lated in a quantitative way. The calibration, however, depends on the actual spin – spin relaxation time and the shape of the pulsatile flow vs. time curve. Due to these Strictly, frequencies should be ratios, but the normalizing denomi- nators in these statistics may be dropped since what is important is limitations, the volume flow rate can be measured with term (or phrase) rank orderings, which are identical with and without a relative error of approximately / / 025%. ( AUTHOR ) b Token frequencies count each distinct occurrence of a word (or MINOR TERMS: Fingers BS. Human. Nuclear Magnetic phrase). For instance, in the sentence ‘‘Row, row, row your boat gently down the stream,’’ the token frequency for row is 3; for gently it is 1.
On the other hand, the record frequency of both of these items is incre- For a term or phrase, X , these four statistics may be calculated in relation to the literature on Raynaud’s literature, R .
Among other non-‘‘noise’’ words, this record containsblood and flow ( from the title ) , flow, blood, fingers, etc.
( from the abstract ) , plus other words from the remaining quency of 2; and Raynaud’s had a token frequency of 1.
MEDLINE record fields. Similarly, the two-word adja- Similarly, the phrase blood flow had a token frequency cency phrases in this MEDLINE record include localized of 4, whereas as blood circulation had a token frequency real, real time, time blood, blood flow, flow measurements of 1. For this single MEDLINE record, the record fre- ( from the title ) , n ovel method, real time, time localized, quency for each of these words and phrases is 1.
localized flow, flow measurements, blood flow, and blood Table 2 gives an example of the four statistics that circulation ( from the abstract ) . Standard information re- would be computed for the term ( or phrase ) X , which trieval techniques can eliminate from consideration non- occurs both within and outside the Raynaud’s subset of substantive words, such as a, for, and is, and can use sentence punctuation to prevent the inclusion of false Gordon and Lindsay ( 1996 ) used these statistics to try phrases such as fingers results ( from the abstract ) .
to identify intermediate literatures for further exploration.
Gordon and Lindsay ( 1996 ) have investigated auto- After calculating each of the four statistics for every term mated processes for supporting the identification of inter- or two-word adjacency phrase in a downloaded literature mediate literatures from MEDLINE records such as these ( such as Raynaud’s ) , they identified the twenty ( or that are based on descriptive statistics similar to those thirty ) items with the highest values for each statistic.
used in information retrieval. Specifically, to identify in- They then considered each of these items to be a query termediate literatures related to the topic Raynaud’s, they that that could be used to identify a different intermediate downloaded the full MEDLINE records for all 1983 – literature. Though the methods used were highly auto- 1985 2 documents that mention Raynaud’s, parsed them mated, the intended use of these methods was to provide as described for the sample record, and then computed support for a qualified medical researcher who could most the statistics shown in Table 1 for every term and two- effectively interpret and act upon the data provided.
In examining the four separate lists of highest-ranked For the MEDLINE record shown above, the word time items, Gordon and Lindsay concluded that three of the had a token frequency of 4; localized had a token fre- statistics — token frequency, record frequency, and tokenfrequency * inverse global record frequency ( igf ) — wereextremely predictive of each other. If a particular term orphrase, such as blood, was among the top 20 positionson one of the lists, very likely it was among the top 20of another list as well. As a specific example, in analyzingthe Raynaud’s literature the four statistics were computedfor each of the approximately 2,000 single-word termsthat occurred at least four times in that literature. If the The two steps in literature-based discovery.
terms on the top 20 list for one statistic were statisticallyindependent of those on another, a fractional number should appear on both lists. What was observed, instead, This date range was the same one that Swanson used and supported Gordon and Lindsay’s replication of Swanson’s results by new methods.
was that the token frequency and record frequency lists JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 Fictitious numerical example showing calculation of four statistics for term Document collection characteristics
Number of Documents
R Å documents mentioning Raynaud’s Subset of R mentioning term (phrase) X Token characteristics
Number of Token occurrences
Value of Statistic
tf * inverse global record frequency (tf * igf) had fifteen ( of twenty ) items in common; the token fre- the hypothesis that Raynaud’s might be treated by fish quency and token frequency * igf lists had seventeen; and oil lay dormant in the literature until Swanson ( 1986a, the record frequency and token frequency * igf had fifteen.
1986b, 1987 ) uncovered it by methods of literature-based In other words, an item’s appearance on the top 20 list for one statistic was highly correlated with its appearance To summarize, Gordon and Lindsay ( 1996 ) demon- on the top 20 list of the other two. The same conclusion strated three statistics that were useful for uncovering held when the number of items per list was increased; intermediate literatures to support literature-based discov- when two-word adjacency phrases were considered rather ery: token frequency, record frequency, and token fre- than single-word terms; and when literatures other than quency * inverse global record frequency. Each of them separately rank-ordered large lists of terms ( and phrases ) There was not nearly the same degree of correlation in quite similar ways. And from the starting point ( Ray- between a term’s occurrence on the top 20 list for relative naud’s in this example ) , a medical researcher using these frequency and its occurrence on the top 20 list of another statistics could be led first to blood, and then to blood statistic. Again considering Raynaud’s as an example, the viscosity, by any of these three statistics ( Fig. 2 ) . Gordon top 20 items sorted by relative frequency included one and Lindsay argued that an effective method for identi- item in common with the top 20 token frequency items; fying an intermediate literature is finding one with strong one item in common with the top 20 record frequency conceptual similarity to the starting point and that each items; and one in common with the top 20 token fre- of the three correlated statistics can serve this purpose, quency * igf items. This pattern held for single words since each has lexical prominence in the Raynaud’s litera- and two-word adjacency phrases, when the top n size was adjusted ( to values other than 20 ) , and when different Latent semantic indexing ( Deerwester et al., 1990 ) offers an entirely different way potentially to identify Not only were the token frequency, record frequency, intermediate literatures and, thus, to support literature- token frequency, and tf * igf lists quite similar, but they based discovery. A standard term by document matrix, were effective in uncovering intermediate literatures on D , is mathematically equivalent to the product of three a discovery path from Raynaud’s to fish oil. By looking other matrices, as shown in Figure 3. M is a matrix of at the very top items on any of the three lists, one was singular values computed by a ‘‘factoring’’ process — led from Raynaud’s ( the starting point ) to the topic blood. singular value decomposition ( Forsythe et al., 1977 ) — Then, by downloading and analyzing the literature on thetopic blood AND Raynaud’s, one was led directly by anyof the three statistics to the topic blood viscosity ( seeFig. 2 ) . Blood viscosity is indeed an intermediate, or‘‘bridge,’’ literature: It is mentioned in the Raynaud’sliterature and is clearly accepted scientifically as beingrelated to Raynaud’s. It is also mentioned in the fish oilliterature, and is scientifically related to that as well. In-deed, there are physiological connections implicating fishoil as a treatment for Raynaud’s, including that fish oilreduces blood viscosity and that increased blood viscosityis one of the reasons Raynaud’s patients suffer symptomsassociated with peripheral blood deficiency. Despite this, Raynaud’s and two intermediate literatures.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 Decomposition of term by document matrix.
that expresses each of the t original indexing terms and However, with equal applicability, latent semantic in- also each of the d original documents as a vector of m dexing can uncover relationships among terms. For in- factors ( where m is the number of linearly independent stance, the terms term-a and term-b demonstrate semantic rows, and columns, in D ) . Technically and intuitively, similarity by occurring together in Doc-2. Similarly, each of the original indexing terms is now expressed as term-a and term-c will bear a transitive, but measurable, a vector of statistically independent factors ( and repre- similarity to each other when a collection like the above sented by a row of the Terms matrix ) ; each document is is represented by means of latent semantic indexing.
similarly represented by a column of the Docs T matrix.
This latter perspective suggests that, perhaps, latent In other words, by means of singular value decomposi- semantic indexing provides an alternative approach to tion, terms and documents are represented in the same uncovering intermediate literatures. Specifically, if terms such as Raynaud’s are thought to stand for underlying The great benefit of representing D as a product of concepts ( the concept Raynaud’s disease ) , then we can three matrices is that we can consider a representational see which terms lie near each other in LSI-space and, space containing just the k õ m most important of these thus, make inferences about conceptual similarity.
dimensions, for k of any size. We can then approximate To test the usefulness of this approach, we began with the 560 documents published during the years 1983 – 1985containing mention of the term Raynaud’s — the same D É D Å Terms 1 M 1 DocsT= documents used by Gordon and Lindsay and by Swanson.
LSI scaling was then performed on this set of documents, where Terms Å t 1 k; M Å k 1 k; DocsT= Å k 1 d.
and the top 100 factors were retained ( k Å 100 ) . Each The result is an optimal reduced dimensional approxi- document, as well as each term used in any document, mation of D ( by a criterion of least squares ) . Practically, was thus represented as a vector in the same 100 dimen- this means that two documents that use strongly overlap- ping vocabulary may both be retrieved even if a particular A central interest of ours was to determine if this query only uses the terms that index one of them. Simi- method produced substantially different ( possibly better ) larly, terms will be considered ‘‘close’’ to each other if results than Gordon and Lindsay’s method of selecting they occur in overlapping sets of documents.
intermediate literatures on the basis of token counts, re- Figure 4 suggests the way latent semantic indexing cord counts, and tf * igf statistics.
assists in information retrieval, using term co-occurrences A fairly crude measure of the similarity between the to give support for document similarity. Pretend that the two methods of generating items associated with Ray- three documents shown are part of a larger collection naud’s is to consider their overlap. To do this, a single where term-a and term-b tend to be used together in in- list of items representing the ‘‘best’’ intermediate items dexing documents, as do term-b and term-c. Then, thequery term-b may still retrieve Doc-1, even though Doc-1 is not indexed by that term. Similarly, the query term-cmay retrieve Doc-1 by virtue of ‘‘transitive’’ co-occurrence.
In other words, term-c co-occurs often with term-b,which co-occurs with term-a . This gives support for re-trieving Doc-1 for the query term-c. This is the ordinaryspirit in which latent semantic indexing is used — to findsimilarity among documents based on their indexing, andthus retrieve documents that do not exactly match a query.
Doc- x indexed by term-y is represented by X r Y .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 by Gordon and Lindsay’s method was developed by tak- The 50 nearest neighbors to Raynaud’s by LSI that were also identified by Gordon and Lindsay’s statistical methods.
j the top 40 terms, according to record counts j the top 40 terms, according to token counts the top 40 terms, according to tf * igf the top 40 two-word phrases, according to record j the top 40 two-word phrases, according to token counts j the top 40 two-word phrases, according to tf * igf This union contained 136 unique items ( an item being either a term or a two-word phrase ) .
The 136 nearest neighboring terms to the term Ray- naud’s according to the LSI analysis were then identified.
This was done by rank-ordering all terms by their cosine to the term Raynaud’s. The size of the intersection of the two lists of 136 times was 57 items ( approximately 42% The best LSI-ranked items ( i.e., those with lowest ranks ) were most likely to be in the Gordon and Lindsay list. Table 3 shows all of the top 50 LSI-ranked terms that also appeared in the Gordon and Lindsay list. Practi- cally every item very close to the term Raynaud’s in LSI space was identified by the Gordon and Lindsay methods.
In particular, of the top 10 items nearest Raynaud’s ac- cording to LSI, Gordon and Lindsay’s methods identified nine. Of the top 20 nearest items ( by LSI methods ) , 15 were identified by Gordon and Lindsay’s methods; of the top 30, 21; of the top 40, 27; and of the top 50, 31 ( Fig.
5 ) . Further, in just examining the very highest-ranked items ( those that each method recommends most stronglyas an intermediate literature ) , we find that each of thetop 10 from Gordon and Lindsay is among the top 12 signing ranks of 41 to items not appearing on a list may items by LSI. In other words, these two lists’ top 10 items suppress slightly the effects of outlying ranks ) . More are nearly permutations of each other. In addition, the simply, an item’s approximate position on the Gordon very highest items in one list tend to be right at the top and Lindsay list ( whether near the top, in the middle, or near the bottom) will predict its approximate position on In fact, a more sensitive analysis was conducted to test for a correlation between the top LSI rank positions and Of course, other methodological approaches could be the top Gordon and Lindsay ranks. Since the Gordon and taken to compute rank correlations, including forming an Lindsay list was the union of six separate lists and an ‘‘average rank’’ for each Gordon and Lindsay two-word item could come from one or more of them, it would not phrase ( based on its three separate ranks ) . However, be- have a unique rank across different lists. Arbitrarily, then, cause the three statistics Gordon and Lindsay used to we selected the Gordon and Lindsay two-word phrase list determine intermediate literatures were so strongly corre- ranked by record counts to provide ranks for use in our lated, this is unlikely to affect our finding in any apprecia- analysis. These 40 times were Spearman rank-correlated with the 40 highest-ranking two-word phrases identified One surprising observation from Table 4 deserves a by LSI scaling ( retaining k Å 200 factors ) . A two-word comment. The phrase d ouble blind is the best-ranked phrase that occurred in one list but not in the other was phrase in LSI but is not among the top 40 items from the assigned a rank of 41 in the list in which it did not appear.
Gordon and Lindsay analysis ( it had rank 45, occurring The null-hypothesis tested was that the top 40 ranks of in 11 records ) . A possible explanation is that the term the Gordon and Lindsay and the LSI lists were uncorre- Raynaud’s lies near the phrase d ouble blind in MED- lated. Data and results are shown in Tables 4 and 5.
LINE. More likely, the prominence of d ouble blind may By this analysis, we can conclude that the top 40 Gor- be somewhat coincidental and actually result from the fact don and Lindsay two-word phrases ( by record counts ) that the phrase occurred in just 11 of the 560 Raynaud’s are rank-correlated with those found by LSI ( even if as- documents analyzed ( 14 times in total ) , but was near JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 Percentage of top N LSI items identified by Gordon and Lindsay.
Raynaud’s by tending to co-occur with all of its chief Raynaud’s and some unknown cause, cure, or treatment for this condition. This conjecture may have been stimu- What conclusions do these various analyses suggest? lated by using a literature-based discovery tool, or it may Principally, that there is a strong overlap among the terms have arisen simply by reading and thinking about Ray- uncovered by LSI scaling and by the Gordon and Lindsay naud’s. Figure 6 may help make this clearer. The sugges- techniques, and that this overlap is strongest among the tion is that a concept 3 that is related to blood viscosity very-top-ranked items by each method. Gordon and Lind- but not directly to Raynaud’s may be uncovered through say have argued that the best terms for identifying inter- mediate literatures are those very close ( semantically and Since blood viscosity is conjectured to be a bridge to statistically ) to the starting point. By this argument, the some unknown discovery, it can be the focus for LSI two methods may provide similar, but complementary, scaling. Selecting the blood viscosity literature to perform approaches for identifying intermediate concepts.
LSI processing on would certainly appear to be an advan- In the next section, we change our focus and discuss tage in finding hidden connections to Raynaud’s since the use of LSI for identifying potential discovery litera- blood viscosity is, in fact, a bridge to a hidden treatment ( fish oil ) . To test the effectiveness of this use of latentsemantic indexing, we proceeded as follows. The 809MEDLINE records published between 1983 – 1985 and Identifying Potential Discovery Literatures
mentioning blood viscosity were downloaded and LSI- A connotation underlying the phrase latent semantic processed ( retaining k Å 100 most important factors ) . A indexing is that hidden relationships among concepts exist list of closest neighbors to the term Raynaud’s was then and, further, that they may be teased out statistically.
constructed according to their cosine to Raynaud’s, but Figure 4 has already illustrated how the concept identified no element on the list could appear in any of the 560 by term-c may bear a latent relationship to the concept Raynaud’s documents from the same period. In other identified by term-a because both terms co-occur with words, we constructed a list of terms that were ‘‘near’’ Raynaud’s ( from the perspective of blood viscosity ) but Is it possible, then, that LSI can form a bridge that were nonetheless bibliographically disjointed from it. The connects two bilbiographically isolated literatures? From items on this list would certainly seem worthy of further Swanson’s work ( 1986a, 1986b, 1987 ) , for example, we know that the concept blood viscosity is scientifically re- A specific interest was whether the phrase fish oil lated to both Raynaud’s and to fish oil, but that neither would appear prominently on this list. More generally, the Raynaud’s nor the fish oil literature refers to each we wanted to see which terminal concepts contained in other, nor are they mentioned together by other docu-ments.
3 Implicitly, we are assuming that a term used in text represents the Suppose, however, that one had conjectured that blood concept with the same name. Accordingly, the term Raynaud’s would viscosity is an important intermediate literature linking JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 Top 40 phrases and ranks by LSI and Gordon and Lindsay a possible cause, cure, or treatment for Raynaud’s. A topic such as tissue hypoxia ( hypoxia means a decreasein normal levels of oxygen ) might be related to Raynaud’s but cannot be considered a terminal concept by the same lines of reasoning, thus it was excluded from our list. Thelist of all Raynaud’s neighbors that were bibliographically disjoint from Raynaud’s and had a cosine ( to Raynaud’s ) of 0.005 or more was examined by hand to remove nonter- minals.4 Table 6 shows the items that remained.
Ignoring the final column, each row in the table shows the value of cosine ( Raynaud’s, terminal term) ; a terminal term’s ‘‘Rank in LSI, non-Raynaud Terms,’’ which only considers terms appearing in blood viscosity documents but not appearing in a Raynaud’s document; and a termi- nal term’s ‘‘Rank in LSI space, all Terms,’’ which tells how many terms had a larger cosine to Raynaud’s, includ- ing all blood viscosity terms and two-word phrases ( in any of the 809 blood viscosity documents ) . For instance, 149 terms had a larger cosine than the term hydroxy- chloroquine, but hydroxychloroquine’s rank of six among non-Raynaud’s items means that there were only five higher-ranked items, each judged a nonterminal, that ap- peared in blood viscosity documents but not in Raynaud’s documents, including the items viscosities and motor ac- tivity. Notice that hydroxychloroquine is the only terminal term in Table 6 that has a cosine value of above 0.10.
By definition, none of the terms in Table 6 appeared in any of the 560 1983 – 1985 Raynaud’s documents; the three-year time span was chosen to correspond as closely as possible to the documents Swanson ( 1986a, 1986b, 1987 ) examined in his Raynaud’s studies. It is possible, of course, that some of the terms in Table 6 occurred along with the term Raynaud’s before 1983. Because we are, in effect, investigating Swanson’s literature-based discovery of the Raynaud’s – fish oil connection, we can ignore co-occurrences after 1985. So we queried MED- LINE to determine the number of documents containing both the term Raynaud’s and each one of the terms in Table 6 in any year before 1986. Results are shown in the last column of the table. This column indicates which terminal terms we can rule out as possible discoveries by Spearman rank correlation for LSI and Gordon and 4 Currently, automatic processing of text is incapable of determining terminal concepts. Thus, identification of terminals must be conductedby hand. This manual step does not diminish our approach, whose the list might suggest a new discovery about Raynaud’s.
objective is to support hypothesis discovery, not automate it. Terminals By way of an example, a substance such as aspirin was can rapidly be selected from lists of terms and phrases, especially by considered a terminal in the sense that it can be considered JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 ing has ruled out terms and phrases that were mentionedalong with Raynaud’s in articles written before 1983. Butits cosine, at 0.016, is a faint signal.
Two other items with null-intersection to Raynaud’s may deserve further study if experts in the field shouldconfirm their merit. Calcium dobesilate has been used forvascular diseases and diabetes retinopathy, among otherconditions. It has been shown to reduce blood viscosity,improve venous insufficiency, and reduce platelet deposi-tion. Niceritrol has been used to treat hyperlipidemia, and,in addition, it has beneficial effects on blood viscosity Blood viscosity as intermediate literature.
and platelet aggregation. The effects of these drugs arerelated to treating Raynaud’s.
It is interesting to note, too, that among the items in virtue of having been already discussed with Raynaud’s Table 6 which have non-empty intersection with Ray- ( those with a nonempty intersection ) .
naud’s are substances such as isoxsuprine and dextran, In regard to our effort to discover directly the Ray- which have been used to treat Raynaud’s. In addition, naud’s – fish oil connection, the results are disappointing.
Fish oil is on the list of nonintersecting items, but is some of the nonterminal, but nonintersecting, items pro- nowhere near the term Raynaud’s ( being its 1961st clos- duced by the analysis suggest possible avenues to exam- est neighbor ) and still behind almost 600 other terms ine in connection with Raynaud’s. For instance, lysoleci- that appear in the blood viscosity, but not Raynaud’s, thin, an acid formed by an enzymatic process in the blood, literature. However, eicosapentaenoic acid, the active is capable of breaking up red blood cells and thus may agent in fish oil, fares much better, being the 208th-ranked prove useful in treating Raynaud’s. This conjecture, too, newly uncovered item in relation to Raynaud’s, and the can be appropriately evaluated by medical experts.
fifth-ranked terminal when additional MEDLINE search- A variation on this approach to finding new connec- Terminal concepts identified as Raynaud’s neighbors plus their MEDLINE intersections with JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 tions to Raynaud’s is to find the nearest neighbor to the Poisson approximations of size of topic in sample.
centroid of all documents comprising the Raynaud’s liter- ature — instead of to the term Raynaud’s. The centroid of a cluster of items is its central value, and is computable in a variety of ways. Experiments showed that considering Raynaud’s to be a document centroid, rather than a term, Directly Identifying Potential Discovery
It is interesting to note that, since Raynaud’s and fish oil truly are medically related, and since this relationship can be detected by other methods, LSI does not directly uncover this latent association, especially since LSI scal- ing was performed on the blood viscosity literature, which The problem may be one of scale. By analogy, a glance at a globe suggests that New York City and Boston are near each other. But they are anything but neighbors when considering only the northeast seaboard of the United States. The same may be true of the Raynaud’s – fish oil association. In the broad context of medicine, these con- cepts clearly are linked by the bridge of blood viscosity.
Nevertheless, blood viscosity ( used as the focus for LSIprocessing ) may be an improper vantage from which to Size of MEDLINE (1980 – 1985) Å 780,000;S Å sample size Å 18,499; detect the association. We may need to ‘‘back up’’ to p Å topic base rate is MEDLINE Å n/780,000.
gain some perspective, just as we can only see that Bostonand New York City are near each other when our perspec-tive is the globe.
period 1980 – 1985, there is an approximately 30.5% An experiment was conducted to explore this possibil- chance that it will not be represented in the of 18,499 ity by attempting to identify a potential discovery litera- documents drawn. On the other hand, by the time a topic ture without first selecting an intermediate literature. In is of size 200, there is a 95% chance that the sample will principle, we desired to analyze all of MEDLINE from contain at least two documents on that topic. All told, the the period 1980 – 1985 ( nearly 780,000 records ) . For method of sampling used likely provided a fair approxi- practical and computational reasons, this was not possible.
mation to a genuinely random sample.
Instead, we tried to obtain an approximately random sam- LSI processing proceeded along the lines already de- ple of MEDLINE from the given date rage. We did this scribed: The set of documents and terms was represented by obtaining all MEDLINE records written in English, by k Å 300 orthogonal factors ( as opposed to 100 in the containing an abstract and at least one reference, and with previous experiment ) to adjust for the larger collection a publication date between 1980 and 1985. By doing size. In this space, there were just over 36,000 terms or so, 18,499 records were identified and downloaded for phrases that were not among those mentioned in the processing ( a sample of about 2.5% for the period ) . It is 1983 – 1985 Raynaud’s document collection. From these possible that including only English-language items in new items, a list of the 1,000 closest neighbors to Ray- the sample may have introduced some bias, for instance naud’s was generated. When we then hand-selected termi- in the areas of pharmacology, where different areas of nals from this list, we obtained a list of 37 items.
the world have approved different drugs for the same To ensure that an item did not occur in a Raynaud’s illness. This concern is reduced by noting that research document earlier than 1983, we consulted the entire performed in Europe and elsewhere around the world has MEDLINE document collection to find the number of a significant representation in the sample, since much documents published any time before 1986 that used both scientific publication is in English. It is also possible, that item and the term Raynaud’s. The cosine, rank, and though unlikely, that the constraint that all records contain intersection data for the hand-selected terminal items are an abstract and reference( s ) distorted the sample in some unintended fashion. Of course, the size of this sample Among the list of items in Table 8 are those with means that some very small topics were likely excluded already known connections to Raynaud’s, including meth- from it. For instance ( see Table 7 ) , if there are only 50 ysergide, hydralazine, and isoxsuprine. Although these documents about a given topic in MEDLINE during the cannot be considered discoveries, their inclusion rein- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 Terminal concepts identified as Raynaud’s neighbors.
* Items with two values, like 0 ( 13 ) for diltiazem hydrochloride, show ( 1 ) the size of the intersection with Raynaud’s of the entire phrase ( diltiazem hydrochloride ʝ Raynaud’s Å 0); and (2) the size of theintersection of its chief chemical constituent ( diltiazem ʝ Raynaud’s Å 13).
forces the idea that LSI processing can help detect possi- we have shown support, but do not automate, discovery, ble treatments for Raynaud’s when a broad, unfocused and that their appropriate interpretation should come from literature ( a random subset of MEDLINE ) is processed medical researchers familiar with the topic. For instance, without the benefit of a predefined connection, such as several of the drugs mentioned in Table 8 are calcium blood viscosity, to link them. Among the items in Table channels blockers; and the nonterminal phrase c alcium 8 are also substances never used before to treat Raynaud’s blocking has a very high cosine ( 0.573 ) . So, without that may deserve exploration as Raynaud’s treatments additional evidence to the contrary, a possibility is that if they were to pass the review of experts in medical calcium channel blockers may be effective in the treat- therapeutics. These include vasodilating agents, such as ment of Raynaud’s, and the nonterminal concept, calcium perhexiline, diltiazen hydrochloride, nylidrin, and li- channel blocking, could itself be analyzed as an interme- doflazine; drugs for treating ischemia, i.e., insufficient diate literature ( its literature downloaded, parsed, and sta- blood flow, such as dihydropyridine derivatives, including tistics computed ) in the search for terminals disjoint from nitrendipine, gallopamil, nisoldipine, and bepridil; and Raynaud’s. On the other hand, research pharmacologists antihypertensive drugs such as diazoxide, captopril, and familiar with calcium channel blockers might know, for example, that those that affect peripheral blood flow ( such We emphasize again that these analyses and all others as nifedipine ) have already been tested as treatments for JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 Raynaud’s, whereas the calcium channel blockers in Ta- larger literature ( MEDLINE, in this study ) that forms ble 8 affect the heart, thus making them ineffective as the universe of discourse. In studying new discoveries in Raynaud’s treatments. So for those with the requisite connection to Raynaud’s disease, the first method was background knowledge, the computed statistics should able to identify fairly prominently a chief chemical con- help stimulate useful conjectures that may lead to the stituent ( eicosapentaenoic acid ) in fish oil using the litera- ture from a time when the healthful effects of fish oil on None of the terms in Table 8 with empty intersection Raynaud’s were unknown. The phrase fish oil was not with Raynaud’s was among the list of nonintersecting nearly as prominent. The second method revealed a very terms in Table 6 ( the equivalent table for the previous experiment, where the blood viscosity literature was LSI- It is important to remember that tools and analyses processed ) . In fact, only one term, isoxsuprine, was com- like those we have described in this paper support, but mon to both tables even when we consider both terms do not in any way replace, scientists. The skilled scientist with empty and nonempty intersection with the term Ray- may see patterns in data like those we report that derive naud’s. As we suspected, a MEDLINE focus for LSI from his or her knowledge of the field. One scientist has certainly produced a different set of Raynaud’s near may see, for example, that a particular class of drugs is neighbors than did a blood viscosity focus. In this sense, prominently represented in the data and begin to form LSI processing of MEDLINE to search for potential dis- hypotheses about this drug class’s ability to treat Ray- coveries directly is another tack to consider in attempting naud’s. A pharmacologist with a more complete back- to uncover latent medical discoveries.
ground in the area may know that certain of these drugs We considered a variation of this method in an attempt are primarily known for their effects on the heart, rather to adopt the MEDLINE focus for LSI processing while than the peripheral vascular system. This type of knowl- retaining some of the advantages of considering blood edge could help isolate the drugs that truly merit scientific viscosity an intermediate literature: We restricted the list investigation by suggesting a more focused analysis.
of items in Table 8 to those that were both bibliographi- The premise behind literature-based discovery support cally disjoint from Raynaud’s and present in the blood is that medical specialization makes it virtually impossible viscosity literature. Only three items met these criteria: for a scientist to stay abreast of developments in areas methyldopa ad, methoxamine, and c aptopril ad. How- outside his or her area of direct interest. As a consequence, ever, in looking at articles on methyldopa and captopril, important connections crossing disciplinary boundaries we learned that both had been studied as a treatment for may never be noticed. Literature-based discovery support Raynaud’s. The reason for this apparent contradiction is tools can help organize the knowledge of scientific fields that the phrases identified, methyldopa ad and c aptopril that lie outside a scientist’s direct specialization, thus im- ad, where ad is a MEDLINE subheading meaning ‘‘ad- proving his or her ability to organize and make use of ministration and dosage’’ were not used in the Raynaud’s literature, even though both of these drugs had been writ- LSI is one tool that may help in this effort. Additional ten about without the ad subheading. Methoxamine research is needed to provide a broader array of tools.
causes vasoconstriction and, as such, would be contraindi- Among other tools that we are investigating are those for: ( 1 ) reporting data at several levels of abstraction ( e.g.,counting as statistical evidence for calcium channelblockers any drug that is in this drug family ) ; ( 2 ) looking Summary and Discussion
for evidence suggestive of ‘‘causal’’ relationships in the Our investigation suggests that latent semantic in- literature ( which may be revealed independently of their dexing might be a useful tool in literature-based discov- statistical prominence ) ; and ( 3 ) using semantic and ery. Because of the difficulty of the task, literature-based category knowledge to improve the step of identifying discovery may be totally unsuccessful for certain prob- terminal concepts, which is now a completely intellectual lems, or by certain methods. LSI provides another tech- process. Through these efforts, we hope to provide scien- nique that can be considered in looking to uncover hidden tists methods that support their efforts to generate discov- ery hypotheses that lie latent in the published literature.
We have shown that latent semantic indexing might be a useful technique in either of the two phases of literature- Acknowledgments
based discovery. During the search for intermediate litera-tures, it fairly closely reproduces ( but extends ) the same This research was conducted while Michael Gordon set of highly ranked terms and phrases that Gordon and was on sabbatical at Bellcore. He thanks Tom Landauer Lindsay ( 1996 ) have shown are a useful starting point and Michael Lesk for that opportunity. This work benefit- for literature-based discover. In helping identify potential ted from discussions with Tom Landauer, George Furnas, discovery literatures, LSI can be used in either of two Jeff Zacks, Robert Lindsay ( University of Michigan ) , ways: by factoring a set of documents associated with a and Don Swanson ( University of Chicago ) . The authors suspected intermediate literature, or by analyzing the also thank the anonymous referees for their careful re- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998 not bibliographically connected. Journal of the American Society for views and their help in strengthening the medical and Information Science, 38, 228 – 233.
methodological arguments contained in the paper.
Swanson, D. R. ( 1988a ) . Migraine and magnesium: Eleven neglected connections. Perspectives in Biology and Medicine, 31, 526 – 557.
Swanson, D. R. ( 1988b ) . Unnoticed connections in the literature of medicine: Implications for knowledge representation and natural-lan- References
guage searching. 1988 ASIS Mid-Year Meeting, Ann Arbor, MI.
Swanson, D. R. ( 1989a ) . A second example of mutually isolated medi- Deerwester, S., et al. ( 1990 ) . Indexing by latent semantic analysis.
cal literatures related by implicit unnoticed connections. Journal of Journal of the American Society for Information Science, 41, 391 – the American Society for Information Science, 40, 432 – 435.
Swanson, D. R. ( 1989b ) . Online search for logically related noninter- Forsythe, G. E., Malcolm, M. A., & Moler, C. B. ( 1977 ) . Computer active medical literatures: A systematic trial and error strategy. Jour- methods for mathematical computations ( chapt. 9 ) . Englewood nal of the American Society for Information Science, 40, 356 – 358.
Swanson, D. R. ( 1989c ) . Medical literatures as a source of new knowl- Gordon, M. D., & Lindsay, R. K. ( 1996 ) . Toward discovery support edge. USDE Final Report, Dec. 1989.
systems: A replication, re-examination, and extension of Swanson’s Swanson, D. R. ( 1990a ) . Somatomedin C and Arginine: Implicit con- work on literature-based discovery of a connection between Ray- nections between mutually isolated literatures. Perspectives in Biology naud’s and fish oil. Journal of the American Society for Information and Medicine, 33, 157 – 186.
Swanson, D. R. (1990b). Medical literature as a potential source of new Swanson, D. R. ( 1986a ) . Fish oil, Raynaud’s syndrome, and undiscov- knowledge. Bulletin of the Medical Library Association, 78, 29–37.
ered public knowledge. Perspectives in Biology and Medicine, 30, 7 – Swanson, D. R. ( 1991 ) . Complementary structures in disjoint science literatures. Proceedings of the Fourteenth Annual International ACM Swanson, D. R. ( 1986b ) . Undiscovered public knowledge. Library SIGIR Conference, ( pp. 280 – 289 ) .
Swanson, D. R. ( 1993 ) . Intervening in the life cycles of scientific Swanson, D. R. ( 1987 ) . Two medical literatures that are logically but knowledge. Library Trends, 41, 606 – 631.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
:LIBER XCV THE WAKE WORLD A TALE FOR BABES AND SUCKLINGS (WITH EXPLANATORY NOTES IN HEBREW AND LATIN FOR THE USE Allégorique, hébraïque, et mystique. Except ye become as little children, ye shall in no wise enter into the Kingdom of Heaven.—ANON. alkz zqrnz zlaud zwnyal zymytsd wylm yjlk 333 zyla zyysktm wwh ajch ruw 334. awh ykh 335. wyylgta atchw halgl aglyjdd zylm Ra-asa isal