Saturday, December 17, 2011

Using Neo4J to visualize SAP PI interface data.

Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database. A graph database is ideally suited for highly connected data. Marko Rodrigues compares the performance of MySQL and Neo4J for a graph query with 1,2,3,4,5 connections.  The information model for a typical SAP PI setup is highly connected. Lets look through the concepts that I want to visualize and explore.

SAP PI system,
Party, Service (or Component), Interface, Interface Map, Message Map,
Communication Channel and   runtime information of all messages (approx 80K per day = 28 M per year).

It should be fast and let me navigate relationships between objects.  I would like to be able to make a query like
Show me all messages between 3:05am and 3:55am with all the relevant information about parties etc.
Show me summary of all messages between 3:05am and 3:55am with all the relevant information about parties etc.
Show me which interface flows (sender/receiver) are effected if a communication channel is down.

Some notes on implementing Neo4J
To setup Neo4J, download the Community version from http://neo4j.org .  After you get the Neo4J server running, launch the browser at http://localhost:7474. Alternatively, you can launch the neo4j-shell command to look around. These are the shell commands that got me started.

cd
mkrel --cd -c -d o -t system -v --np "{'name':'PXP'}"
mkrel --cd -c -d o -t party -v --np "{'name':'PCE_LNSFA','type':'XIParty', 'partyagency':'http://sap.com/xi/XI','partytype':'XIParty'}"

The first line takes you to the reference node which is created by default in the starting database. 
The second line creates an outbound relationship called "system" and a new node at the end of the relationship. The new node has the property "name" = "PXP". The "name" property is used by the shell to give a friendly name to the node id. The "--cd" option moves you to the newly created node.
The third line creates an outbound relationship called "party" and a new node at the end of the relationship. The new node has the property "name" = "PCE_LNSFA" and a few more properties. So lets take a look at our graph.
cd
trav

The first line takes you to the starting node and the second line will traverse the entire graph for you.

mkrel --cd -c -d o -t interface -v --np "{'name':'IOA_ChangeCRMTaskRequest','type':'XIInterface', 'namespace':'http://philips.com/pce/lotesnotes/sales/tasks','addressid':'4ACB55C22850085AE1008002828BD49C'}"
index -i interface addressid

The first line here creates outbound relationship called "interface" and a new node at the end of the relationship. The new node has the property "name" = "IOA_ChangeCRMTaskRequest". 
The second line adds the newly created node to an index called "interface", with the property addressid stored in the index. This makes it easy to look up the node later using the following
index --cd interface addressid "4ACB55C22850085AE1008002828BD49C"
This will find the interface node and take you directly to that node. If we had indexed the first two nodes like this
index -i myindex system
index -i myindex xitype
we could have searched using
index --ls -q myindex "system:PXP AND xitype:service"

Once you get used to the basics, you can try your hand at advanced queries (or traversals in case of a graph). Take a look at the syntax of Cypher query language to understand this.
start n = (0) 
match (n)-[:systems]->(sys)-[:party]->(parties)-[:service]->(service) 
where sys.name = 'PXP' 
return n,sys,parties,service

The variables are specified as (varname). Basic syntax is
start 
match
where
return

You can use aggregate functions as well.
start n = (0) match (n)-[^1..4]-()-[:xi_interface]-(i) return i
start n = (0) match (n)--(p)-[^1..4]-()-[:xi_interface]-(i) return n,p, count(*)
The first line will display all nodes that are linked 1-5 relationships away from the reference node and the last link is xi_interface.
The second line, sums this up - grouping by the (p) - i.e. the system. So it displays number of interfaces per system.



start n=(message,'firstday:20110930 AND firstts:[20110930000000 TO 20110930005900]') match (n)-[:sends]-(s) where s.system='PXP' return s.party,count(*)
start n=(message,'procmode:S') match (n)-[:sends]-(s) where s.system='PXP' return s.party,s.service,s.interface,count(*)
start n=(message,'procmode:S') match (r)-[:receives]-(n)-[:sends]-(s) where s.system='PP4' return r.party,r.count(*),avg(n.latency),avg(n.msgsize)
start n=(message,'procmode:S') match (r)-[:receives]-(n)-[:sends]-(s) where s.system='PXP' return r.party,r.service,r.interface,s.party,s.service,s.interface,n.firstts,n.lastts,n.latency,n.dbentry,n.maprequ,n.msgsize,n.msgid
start n = (0) match (n)--(sys)--()--()--(iface)--(message) where message.latency > 2000 return iface,count(message)




Thoughts after the experiment
After a certain number of records, the database slows down. I tried this with a database of 22GB. Some queries did not return after a long time. So you may want to split into multiple databases. The web interface is fine to visualize limited graph nodes - but does not scale up. Use the command line when you have too many nodes.