All versions of this manual
X
 

Search index: Neo4j to Elasticsearch

Neo4j-to-elasticsearch is a Neo4j plugin that enables automatic synchronization between Neo4j and Elasticsearch. This means that all changes to Neo4j are automatically propagated to Elasticsearch.

Neo4j-to-elasticsearch plugin is not compatible with Neo4j v4.x.

Install neo4j-to-elasticsearch

Follow these steps to install the plugin:

  1. Download the GraphAware framework JAR
    • Choose a version A.B.C.x where A.B.C matches your Neo4j version and x is 44 or later
  2. Download the neo4j-to-elasticsearch JAR
    • Choose a version A.B.C.x.y where A.B.C matches your Neo4j version and x.y is 44.8 or later
  3. Copy graphaware-server-community-all-A.B.C.x.jar and graphaware-neo4j-to-elasticsearch-A.B.C.x.y.jar to your neo4j/plugins directory
  4. Add the following lines to the beginning of your Neo4 configuration file (neo4j/conf/neo4j.conf):
    com.graphaware.runtime.enabled=true
    com.graphaware.module.ES.1=com.graphaware.module.es.ElasticSearchModuleBootstrapper
    com.graphaware.module.ES.uri=HOST_OF_YOUR_ELASTICSEARCH_SERVER
    com.graphaware.module.ES.port=PORT_OF_YOUR_ELASTICSEARCH_SERVER
    com.graphaware.module.ES.mapping=AdvancedMapping
    com.graphaware.module.ES.keyProperty=ID()
    com.graphaware.module.ES.retryOnError=true
    com.graphaware.module.ES.asyncIndexation=true
    com.graphaware.module.ES.initializeUntil=2000000000000
     
    # Set "relationship" to "(false)" to disable relationship (edge) indexation. 
    # Disabling relationship indexation is recommended if you have a lot of relationships and don't need to search them. 
    com.graphaware.module.ES.relationship=(true)
     
    com.graphaware.runtime.stats.disabled=true
    com.graphaware.server.stats.disabled=true
  5. Restart Neo4j
  6. Once Neo4j has finished indexing the data, remove the following line (and only this line) from neo4j/conf/neo4j.conf:
    com.graphaware.module.ES.initializeUntil=2000000000000

Please note that indexation could fail if your data uses different data types for the same property key. For example, if a property representing a date uses ISO strings in some nodes and timestamps in others. If you encounter this issue, please get in touch.

Note regarding initializeUntil

The initializeUntil specification is used to trigger the indexation of existing Neo4j data in Elasticsearch. This is because com.graphaware.module.ES.initializeUntil must be set to a number slightly higher than a Java call to System.currentTimeInMillis() would normally return when the module is started. Thus, the database will be (re-)indexed only once, and not with every subsequent restart.

In other words, re-indexing will happen if System.currentTimeInMillis() < com.graphaware.module.ES.initializeUntil.

Integrate with Linkurious Enterprise

Once the neo4j-to-elasticsearch plugin is installed, you need to change the relevant data-source configuration to use neo2es as its search index vendor.

You can either use the Web user-interface or edit the configuration file located at linkurious/data/config/production.json to set the index.vendor property to the value neo2es.

Troubleshooting and Optimization

For smaller indexation tasks, neo4j-to-elasticsearch can be used straight out of the box. Larger indexes are trickier. If you find that indexation is failing to complete, or that search is unusually slow, you will need to settle for partial indexation in order to keep Elasticsearch usable.

Partial indexation can be configured in your Neo4j configuration file (neo4j/conf/neo4j.conf) by specifying a subset of your graph to index. You can do this by selecting which types of nodes and relationships to keep and which properties to index on these nodes.

(For a detailed list of configuration options, please consult Neo4j-to-elasticsearch's official documentation, available at Graphaware's Github repo. This guide will focus on a few use cases that should be adaptable to a wide variety of graph models.)

Configuration Options

Partial indexation is handled by four options which can be added to your Neo4j configuration file:

com.graphaware.module.ES.node=...
com.graphaware.module.ES.node.property=...
com.graphaware.module.ES.relationship=...
com.graphaware.module.ES.relationship.property=...

com.graphaware.module.ES.node and com.graphaware.module.ES.relationship control which nodes and relationships to index. com.graphaware.module.ES.node.property and com.graphaware.module.ES.relationship.property control which properties of these nodes and relationships to index.

Each of these lines is followed by one or more parameters. Parameters are boolean expressions. They can be chained together using standard logical operators && (AND) and || (OR). The most basic parameters are (true) and (false). They will tell Elasticsearch either to index everything (the default behavior) or to index nothing. Note that the parentheses are required in order to force Elasticsearch to ignore these nodes while loading your database into the index.

To reiterate, if we add the line

com.graphaware.module.ES.relationship=(false)

to our configuration file, Elasticsearch will not index any of the relationships in our database.

The next step up is parameter functions. These allow for more complex inclusion and exclusion rules. For both nodes and relationships, the following two functions are available:

  • getProperty('propName', 'defaultValue'): Returns a property value, allowing you to compare it to another value using standard comparison operators.
  • hasProperty('propName'): Returns true or false depending on whether a node or relationship contains a certain property.

Additionally, there are functions specific to nodes and to relationships. For nodes, those that are useful for partial indexation are:

  • getDegree(): Returns the degree (number of relationships) of a node. Can be used with comparison operators.
  • hasLabel('labelName'): Returns true or false depending on whether the specified label is present on a node.

And for relationships, they are:

  • isType('typeName'): Returns true or false depending on whether the specified type is present on a relationship.
  • isOutgoing(): Returns true or false depending on whether a relationship is outgoing.
  • isIncoming(): Returns true or false depending on whether a relationship is incoming.

(You can find a full list of functions in Graphaware's inclusion policies maintained on their Github repo.)

Example

Say we have a due diligence database containing people, banks, account numbers, addresses, telephone numbers, and email addresses. It's a large database -- several hundred million nodes -- so to fully index the graph would be a lengthy process, and may ultimately be unnecessary if we are only interesting in full-text search on a restricted set of node and relationship properties.

The first question we should ask is about this data we are interested in. Maybe as part of our hypothesis about the data, we want to focus on a certain subset of connections that we believe form patterns of interest.

Let's say that we want to focus on identifying information only -- we think that there are cases where this information is shared by multiple individuals, for instance, and we're interested in analyzing them. The nodes of interest to us will therefore be those nodes which represent pure identifiers and not entities themselves -- addresses, telephone numbers, account numbers, and email addresses. We can tell Elasticsearch to index these and only these by adding the following line to our Neo4j configuration file:

com.graphaware.module.ES.node=hasLabel('Address') || hasLabel('Telephone') || hasLabel('Email') || hasLabel('Account')

(com.graphaware.module.ES.node=!hasLabel('Person') && !hasLabel('Bank') will also work.)

And since we aren't including people or banks, we also want to focus on the relationships relevant to our nodes of interest:

com.graphaware.module.ES.relationship=isType('HAS_ADDRESS') || isType('HAS_PHONE') || isType('HAS_EMAIL') || isType('HAS_ACCOUNT')

If we want to be even more specific, we can select only those properties which are relevant to our inquiry by adding com.graphaware.module.ES.node.property to our config. We follow this with a list of keys, or node property names, (separated by || (OR) statements) that we want to include in our index:

com.graphaware.module.ES.node.property=key == 'address1' || key == 'city' || key == 'state' || key == 'number' || key == 'email' || key == 'accountNumber'

And if we only want to index nodes and NOT relationships, we can disable relationship indexation completely:

com.graphaware.module.ES.relationship=(false)

Since relationships are often more numerous than nodes, excluding them from our index can significantly reduce its storage footprint.

What if we finish our initial analysis and conclude that we need to know more about the financial institutions in our graph? We want to add banks to our index, but we don't need to search everything that is stored on them. Furthermore, we're only interested in banks in a certain region -- Europe, say. We can add the right banks back by modifying our original directive to read:

com.graphaware.module.ES.node=hasLabel('Address') || hasLabel('Telephone') || hasLabel('Email') || hasLabel('Account') || (hasLabel('Bank') && getProperty('bankRegion', 'None') == 'Europe')

And we can add a few bank properties like this:

com.graphaware.module.ES.node.property=key == 'address1' || key == 'city' || key == 'state' || key == 'number' || key == 'email' || key == 'accountNumber' || key == 'bankName' || key == 'bankIdentifier' || key == 'bankRegion'

Keep in mind that these keys will be indexed for every node on which they appear. If banks have a state property, for example, it will be added to the index for state. It's worth remembering this when constructing your data model. Namespace collisions can prove computationally costly.

Summary

Neo4j-to-Elasticsearch, in combination with partial indexation strategies, can be a very efficient way of handling index synchronization for large graphs. Even if your graph is small enough to index in full, it's worth considering to what extent it may grow. By examining your problem space and choosing only a subset of the information available to you for your traversal needs, you may save yourself future headaches and maximize the efficiency of your graph.