Entity Lookup in Apache Stanbol just got much faster
[mt_dropcap style=”3″] I [/mt_dropcap] committed the first version of the FST linking engine. This engine implements entity linking functionality based on Lucene (Finite State Transducer) technology. This allows this engine to perform the label based Entity lookup fully in-memory. Only Entity specific information (URI, labels, types and ranking) for tagged Entities need to be loaded from disc (or retrieved from an in-memory cache).
This Engine does not directly use Lucenes FST API, but re-uses the OpenSextant SolrTextTagger  module implemented by David Smiley.
To give users some Idea on how efficient FST can be used to hold information these are the statistics for FST models required for Entity Linking against [Freebase](http://freebase.com):
- Number of Entities: ~40 million
- FST for English labels: < 200MByte
- FST for other major languages are all < 20MByte
- FST for all ~200 used language codes are about 500MByte
That means that multi lingual in-memory entityLinking against Freebase can be done with 500MByte of RAM!
The engine is currently not included in the default build as one of its dependencies (version 1.2 of the SolrTextTagger is not yet released). So to test it you will need to go to `enhancement-engines/lucenefstlinking` and follow the the steps described in the README.md 
The README.md  also provides details on how to configure the Solr Index used with the Engine and the Engine itself.
Performance characteristic changes (over the current EntityLinking engine):
Most important: With the FST linking engine the matching of entity labels with occurrences in the text is fully done in-memory. No disc IO is needed for that part. The current EntityLinkingEngine does the same by using Solr queries.
However the FST linking engine does get the int Lucene document IDs as result of the linking process. Therefore it needs to load linking relevant information for those IDs (URI, labels, types and rankings) from the Solr Index. This does require disc IO. To reduce the impact of this the FST linking engine includes an LRU cache over those information. The EntityLinking engine gets those information “for free” in the result lists of the Solr queries.
So to sum up: While the EntityLinking engine spends about 95% of its time to execute the Solr queries the FST linking engine spends most time in loading the Entity information from disc.
Initial Performance Tests:
I performed a Test on my MacBook Pro Core i7 2.6GHz, SSD with sending 5k dbpedia long abstracts with 10 concurrent threads with the Enhancer Stress Test Tool  to chains that included Language detection, OpenNLP Token, Sentence and POS tagging and
- (A) FST linking engine configured for Freebase with a Document Cache size of 1 million vs.
- (B) EntityLinking engine also configured for freebase.
- (A) average of 70ms for FST linking (with 100% CPU)
- (B) average of 390ms for EntityLinking
when doing the test with ProperNoun linking deactivated (basically also linking Common Nouns to simulate longer texts) it gives the following results:
- (A) average of 267ms for FST linking (with 100% CPU)
- (B) average of 1417ms for EntityLinking
In both cases the FST linking engine is about 5 times faster as the currently used EntityLinking engine.