Accessor Configuration Files

Any accessor which extends AbstractDataAccessor may use an XML configuration file to initialize its settings. This configuration has a particular schema which defines initialization settings. It also provides how various source fields should be interpreted, the data templates for mapping source data into data objects, and how the various graph, grid, tree, and dataset structures should be composed.

There are several example configuration files in the config directory. The elements within the files are described below:

    <dataConfig>
        <init-param>
            <param-name>queryUrl</param-name>
            <param-value>http://mysolr.server.com/solr3/</param-value>
        </init-param>

The first elements of the file are initialization parameters. These follow the same style and syntax as the initialization parameters of a web.xml file. The name-value pairs are passed as a hashmap of Strings to the Accessor's setInitParameters() method.

    <dataTemplate name="employee" nameKey="Full Name">

The dataTemplate tag defines a particular data template. Each data template defines a different way that the data source might represent data. The nameKey identifies which source field is used for the unique identifier for each record of data.

        <fieldDesc fieldName="Supervisor" fieldType="text" />
        <fieldDesc fieldName="Department" fieldType="enum"
            sourceField="Dept.">
            <values>
                <value>Accounting</value>
                <value>Software</value>
                <value>Marketing</value>
                <value>Operations</value>
                <value>Corporate</value>
            </values>
        </fieldDesc>
        <fieldDesc fieldName="Birth Place" fieldType="location" />
        <fieldDesc fieldName="Id" fieldType="text" sourceField="SSN" />
        <fieldDesc fieldName="Projects" fieldType="text"
            sourceField="Project1,Project2,Project3" multiValue="true"/>
        <fieldDesc fieldName="Birth Date" fieldType="time"
            sourceField="Birth_Date">
            <values>
                <format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</format>
            </values>
        </fieldDesc>
        <fieldDesc fieldName="When Employed" fieldType="time"
            sourceStartField="Start_Date" sourceStopField="End_Date">
            <values>
                <format>yyyy-MM-dd</format>
            </values>
        </fieldDesc>
        <fieldDesc fieldName="Salary" fieldType="measure">
           <values>
                <type>currency</type>
            </values>
        </fieldDesc>
        <fieldDesc fieldName="Phone Extension" fieldType="int"
            sourceField="Extension">
            <transform>-x</transform>
        </fieldDesc>
    </dataTemplate>

The field descriptors define how the various fields are mapped into internal fields. The fieldName determines how the field will be referred to. For SemanticAccessors, this is the field name within the DataRecord. The sourceField determines which field within the source data is read for the field. If the sourceField is omitted, the accessor uses the fieldName as the sourceField. The sourceField can reference one or more source fields, separated by commas. The multiValue attribute indicates whether the field can include more than one value; if omitted, the field is a single-value field.

The fieldType determines what type of data the field represents. Valid values are enum, int, location, measure, text, and time. Each of these field descriptors may include an optional default tag, which defines the default value to be used if the field is not set. The field descriptor may also include a values tag, which is used by different field descriptors in different ways. These are listed below:

The descriptor may also include a transform tag, which indicates a particular transform which should be applied to the values from the source field. In the example above, the Extension field contains a transform which removes any 'x' from the field. This would convert "x234" to "234", which could then be parsed as an integer.

The time field descriptor may refer to a sourceStartField and either a sourceStopField or a durationField instead of a sourceField. This is used when the time is a span of time, rather than a single instance. These specify which fields are used to construct the span of time.

Note that the type for unstructured text is text, not string. Several field types, notably the enum field and the location field, may also specify strings. Identifying the unstructured text field as text helps to define the expected use of the field.

    <dataTemplate name="phoneCall" nameKey="callId">    
       <fieldDesc fieldName="Call Time" fieldType="time"
            sourceStartField="Start" durationField="Call_Time">
            <values>
                <format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</format>
                <durationUnits>seconds</durationUnits>
            </values>
        </fieldDesc>
        <fieldDesc fieldName="Sender" fieldType="text" />
        <fieldDesc fieldName="Receiver" fieldType="text" />
    </dataTemplate>

This is a second dataTemplate in the same file, which identifies a series of phone calls. An accessor may have any number of templates.

Below the templates are the definitions for the various data structures. Each of these definitions specifies the templates used to construct the structure. These template references may refer to multiple templates, separated by a space. Certain structures also include particular field references. For instance, the tree structure would need to identify which field of a node refers to its parent.

    <dataset name="workers" itemClass="employee" />

This defines a dataset called workers, which uses the employee data template.

    <graph name="calls" nodeClass="employee" edgeClass="phoneCall"
        origField="Sender" destField="Receiver" directed="true" />

This defines a directed graph of phone calls. The employee template is used for the nodes, and the phoneCall template is used for the edges.

    <tree name="orgChart" treeNodeClass="employee"
        parentField="Supervisor" />
</dataConfig>

This last segment defines a tree representing the company organizational chart. The names of the various structures are used by the various getStructure calls, so calling getGraph("calls") would return the graph of phone calls.

The configuration file may include ontology and reasoning tags, following the last structure definition, before the closing tag. These tags are described in the section on Ontologies.