Full-Text Search and Indexing

Simol supports integrated full-text searches of SimpleDB domains using Lucene.NET. To get started indexing a domain you need to do the following things:
  1. Mark one or more string properties of your data class with IndexAttribute.
  2. Mark a DateTime property on your data class with the VersionAttribute.
  3. Set the value of SimolConfig.Indexer.IndexRootPath to the path where index files should be stored.
  4. Register your item mapping with an instance of IndexBuilder.
  5. Start the IndexBuilder to kick off the indexing process.
  6. Update the version property to the current time each time you insert or update the contents of an indexed property

Let's walk through a code example. First we'll create an Address class and mark all string properties with IndexAttributes and the ModifiedTime property with a VersionAttribute:

    public class Address
    {
        [ItemName]
        public Guid Id { get; set; }

        [Index]
        public List<string> StreetAddresses { get; set; }

        [Index]
        public string Zipcode { get; set; }

        [Index]
        public string State { get; set; }

        [Index]
        public string City { get; set; }

        [Version]
        public DateTime ModifiedTime { get; set; }
    }

Next we'll set our index root path, create an instance of IndexBuilder, register our address mapping, and start the index builder:

    simol.Config.Indexer.IndexRootPath = @"C:\MySimpleDBIndexes";
    ItemMapping addressMapping = ItemMapping.Create(typeof(Address));
                
    var builder = new IndexBuilder(simol)
    {
        UpdateInterval = TimeSpan.FromSeconds(1)
    };
    builder.Register(addressMapping);
    builder.Start();

As long as the indexer is running any new items added to the address domain will be added to our index. The delay between item storage and indexing depends on the value of IndexBuilder.UpdateInterval. We can find indexed items using the SimolClient.Find methods. Here's an example:

    Address address1 = new Address
    {
        Id = Guid.NewGuid(),
        City = "Atlanta",
        State = "Georgia",
        ModifiedTime = DateTime.UtcNow,
        StreetAddresses = new List<string> {"12345 Peachtree Road", "Suite 400"},
        Zipcode = "30301"
    };
    simol.Put(address1);

    // sleep while the indexer runs
    Thread.Sleep(TimeSpan.FromSeconds(5));

    List<Address> addresses = simol.Find<Address>(@"StreetAddresses: ""Peachtree""", 0, 1, null);

The SimolClient.Find method first searches the index of the addess domain using the specified full-text query. Any items found in the index are then retrieved from SimpleDB and returned. In the query above we're searching just the StreetAddresses property for the word "Peachtree".

The search query text is simply passed through to the installed IIndexer, which is an instance of LuceneIndexer by default. The Lucene query syntax is documented here. See the SimolClient API documentation for more details on using Find methods to search the index.

Architecture

Indexing and item updates are decoupled in Simol. What this means is that items are not immediately indexed when you insert them in SimpleDB using SimolClient.Put. Instead, each indexed domain is independently "crawled" by the IndexBuilder to retrieve batches of new or updated items. The domain index is then updated with the current content of each item. This decoupling makes indexing slightly more expensive but more scalable, reliable, and maintainable.

The default indexer creates a separate index for each domain and indexes each property in its own Lucene index "field". Multi-valued attributes are concatenated into a single field for indexing. The state of each index is tracked using a Simol-controlled domain named "SimolSystem". If we examined the SimolSystem domain after running the address indexing example above we would find the following attributes:

Id MachineGuid DataType DomainName LastIndexedVersion HostName
f4e7d071-f460-4617-9d3e-2f0bf62973a6 c52f1263-c5f5-4c31-acf1-65b98b73d2b9 IndexState Address 2010-02-12T15:48:24.739Z Indefatigable

These fields serve the following purposes:
  • MachineGuid - Unique identifier for host running the Simol index builder. Value is derived from the {HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Cryptography\MachineGuid} registry key, which will only change if the operating system is reinstalled.
  • DataType - Used by Simol to identify the type of system information stored in this item
  • DomainName - The domain index tracked by this item
  • LastIndexedVersion - The last item version indexed by the index builder
  • HostName - The name of the host running the index builder

Since the index state is tracked in SimpleDB you can freely stop or restart the index builder while your indexed domains are being updated, and those changes will be reflected in your indexes whenever the index builder next runs.

See the LuceneIndexer API documentation and the Lucene.NET projectfor more implementation details.

Scaling and Multiple Servers

The simplest way to scale full-text indexing across multiple servers is to run the index builder on each Web host. This is somewhat inefficient but allows Simol itself to replicate your indexes across all servers with very little configuration or maintenance work. This also ensures that Simol find operations on all servers are running against fresh indexes.

Depending on your particular needs you may wish to consider the following alternative scaling strategies:
  • Run the index builder on a single server and maintain the indexes on a shared drive or storage array - The shared index files could be read for searching, but not updated by other servers. This strategy avoids indexing the same data multiple times, however storing the Lucene indexes on shared drives will degrade the performance of searching and indexing and may cause other maintenance issues.
  • Run the index builder on a single server and replicate the index files to other servers by copying the index files - This is probably the most efficient solution. However, the file copy process may interfere with searching on the servers being updated. You would need to schedule the rotation of each affected server out of the active pool while its indexes were being updated. Though efficient this solution will be more complex to implement except in situations where the indexed data changes infrequently or very stale indexes are acceptable (meaning the index replication is infrequent).

Before attempting to scale Simol indexing across multiple servers you should acquire a good understanding of Lucene.NET. You should consider implementing your own IIndexer or at least understand the LuceneIndexer well enough to adapt it to your needs.

Rebuilding Indexes

Rebuilding full-text indexes is fairly straightforward, but takes a few manual steps.

To force a complete rebuild you should first delete the index files in question, then remove the IndexState record for the relevant server from the SimolSystem domain (see the previous section for details on this domain). The next time the index builder runs it will re-index all items in the domain.

To force a partial rebuild, simply edit the IndexState record for the relevant server and set the LastIndexedVersion back to the date and time where re-indexing should begin.

Adding New Servers

To add a new server without re-indexing all your domains, copy the index files from an existing server and manually insert an IndexState item in the SimolSystem domain for the new server. It's very important to get the following two fields correct when creating the new IndexState item:
  • The LastIndexedVersion attribute must be set to a date and time at or slightly before the time the index files were copied or the new index will have a gap with some missing items.
  • The MachineGuid attribute must be set to the MachineGuid of the new server. Otherwise the IndexState record will not be found by the index builder and the entire index will be rebuilt anyway.

Deleting Items

Simol does not directly support deleting items from your full-text indexes. You are free to open the index files using the Lucene API and delete items yourself, however this may not be necessary.

If you delete an item from SimpleDB that remains in the full-text index, it will simply be ignored when invoking the SimolClient.Find methods. In other words, the Lucene index query may return the deleted item ids, but since the item no longer exists in SimpleDB it will never be returned by Simol. Unless you are adding and deleting indexed items frequently enough that the presence of "dead" items degrades performance, you may be able to ignore the issue of deletes altogether.

One way to ensure the content of deleted items is no longer in your full-text index is to first update the items after setting their indexed property values to null. The index builder will then update the full-text index, leaving the items in the index but removing their (now empty) fields. Once you've given the index builder enough time to re-index the pseudo-deleted items you can then remove the items entirely from SimpleDB with an asychronous process that runs periodically.

Last edited May 24, 2011 at 12:46 PM by ashleytate, version 26

Comments

ashleytate Nov 8, 2010 at 7:38 PM 
Thanks!

tim124 Oct 14, 2010 at 10:37 AM 
Great walk-through the indexing capabilities. I knew that you'd implemented indexing, but hadn't thought how much work that entailed!