OSGi at the UK's biggest science lab

OSGi at the UK's biggest science lab

As a Java developer, you undoubtedly know about the goodness of OSGi and breaking up your class loading into modules. After all, OSGi is the dynamic module system, right? Not a huge deal. You might have played around with declarative services, or perhaps you are waiting for Jigsaw? Java these days is a very mature technology stack and even though the barrier of adoption for OSGi is low–and I mean really low–plenty of products have yet to migrate to dynamic class loading.  This is especially true if the product is large, mature, and not intended for a major refactor. At three million lines of Java-server and a thick-client code, our product at Diamond Light Source fits exactly that description. Nonetheless, we recently moved our code base to OSGi. In this article I’ll explain why we made the change and offer seven real-world challenges encountered and how we resolved them.

Java technology, applied to science

Diamond Light’s synchrotron works like a giant microscope, harnessing the power of electrons to produce bright light that scientists can use to study anything from fossils to jet engines to viruses and vaccines. The United Kingdom’s  largest science project and one of the world’s most advanced facilities, the synchrotron is used by over 10,000 scientists running experiments.

To produce the high energy light that scientists need to conduct their research, engineers at Diamond Light accelerate electrons, then move them around using magnetic fields. The light comes out of a circular machine, which is the starting point for a huge range of experimental techniques. Where it exits, the light goes through an optics hutch. Individual experiments are run using our Java-based acquisition system.

The Diamond Light Source synchrotron

Figure 1. Beamlines radiating from the Diamond Light Source synchrotron. Image credit: Diamond Light Source.

The linear experimental parts of the facility, radiating from the circular synchrotron, are called beamlines. Currently, 33  beamlines are either in operation, in construction, or being designed. All of them have or will require a Java server and a client able to coordinate experiments and serve as an interface for visiting scientists to control the synchrotron. The software must be able to move motors (because there are usually x-rays in the experimental hutch of the beamline); trigger detectors (imagine something like a digital camera); and write large binary files of data, often at high rates. Some beamlines require detectors able to write many megabytes of data at a kilohertz rate.

Modernizing a legacy system

Part of Diamond Light’s Java software stack was inherited from the Daresbury synchrotron called the Synchrotron Radiation Source. The SRS was closed in 2008, but some of its software lived on in Diamond Light’s acquisition and analysis systems.  This has been very useful because, as we know, algorithms never die (although they may mutate). While some of the major features and ideas for working with the software came to us from SRS, a developer today might choose to do some things differently. For instance, the legacy system’s client and server used CORBA to communicate. It also employed a large classpath with many interconnected dependencies in the server. And it relied on a thick client based on Swing. The client had a neat contributed design, however, which allowed custom experimental parts to be mixed with general purpose ones; that was a capability we didn’t want to lose.

For our first foray into OSGi, we chose to migrate and re-write part of the client. We moved from Swing to SWT/JFace using the Rich Client Platform (RCP) which is available from the good people at Eclipse Foundation. The move led us to adopt an Equinox classloader for our client. It was not our intention to migrate the client because of dynamic class loading; that was something that came with the new platform. We used it first, not to make the server modular, but to make the client start faster.  It worked well for that purpose, so there was no real reason to modify the server architecture. For the next five years, we didn’t.

The creeping cost of support

So what changed? Well, like a lot of real-world projects, the proportion of maintenance work developers were doing started going up. This was especially costly when compared to time spent writing software that the new beamlines needed. In many cases, maintenance became almost all a developer was doing. These days we run a variation of a devops shop, so developers are usually involved with supporting systems as part of their work. This is the correct approach for us, but if developers aren’t also innovating and creating new software, we know that something has gone wrong.

A lot of what we do at Diamond Light requires creative input from developers to get the new science available to our users. But over time, we built up technical debt. Some signs of our technical debt included:

  • Using different APIs that do the same or similar things
  • Making overly interconnected projects and classes
  • Improper encapsulation of functionality
  • Writing adequate unit tests rarely

Another thing we did was to run from source. Yes, you read that correctly: we manually pulled out the software from its repository and built it specially for each experiment, leaving the source code and bytecode compiled as a bespoke version for each given beamline. We have all the usual stack of integration tools, an automated build for each beamline in Jenkins, JUnit tests, Squish for the user interface, and so on, but ultimately a developer was pulling a custom product out of the repositories, changing certain files by hand, and leaving this version for the next run of the machine.

The system wasn’t efficient or reproducible, which meant it had to change. After reading online articles, learning at industry conferences, and taking input from new colleagues, we came up with a plan to move our server to something closer to industry standards. The path, however, was not entirely smooth.

Real world problem #1: Integration

The first thing we decided to do was make a single-server product for the data acquisition server. One that could be used on any beamline and was created with a binary built from a reproducible build. OSGi was a perfect fit for this project: bundles are loaded dynamically, after all, and one of the main reasons for dynamic classloading is that the binary product size can grow beyond that which is in memory. Using OSGi meant that beamline-specific bundles–for example, those dealing with certain detectors or specific libraries for decoding streams–could be built into the single product. Only if they were used on an experiment would they be class-loaded and take up space in the virtual machine (VM).

So far so good, but we had lost one strong advantage: the original “running from source” approach to code integration, which had allowed us to change and debug beamline code on the fly. We needed this integration capability because our developers are often required to deliver complex and variable requirements at the last minute–such as integrating a laser into the data acquisition timing. Fortunately, it’s possible to insert code into a running OSGi VM using various tools: we looked at Hot Swap Agent and JRebel seriously. After some deliberation, we chose JRebel because it integrated easily with our server. Our OSGi-based system requires that developers commit and build/test code into the single product before it is left as a product on a beamline, but using JRebel gives us the flexibility to develop code (temporarily) on the live system.

Real world problem #2: Multiple configurations

We were already using Spring as an instantiation layer. For us it wires together a beamline’s configuration, building things like connections to motors, detectors, and online data analysis. We chose to keep our Spring configurations unchanged and run them with the single server, so all the existing classes in disparate Java projects had to work. When the server starts, the OSGi container loads the main classes, after which the Spring configuration is run. In some cases, Spring could require a class that the OSGi container had yet to load. Solutions like Blueprint with Apache Aries and Spring DS are often well suited for such scenarios. In our case, because we’re using Equinox, we decided to use Eclipse Buddy, which is less elegant but it works. Two things were essential about the config:

  • When making the bundle with Spring JAR files in it, for instance called org.acme.spring.bundle, the manifest  should contain the header Eclipse-BuddyPolicy: registered.

  • The bundle containing the class to be loaded by Spring also must contain the header Eclipse-RegisterBuddy: org.acme.spring.bundle in its manifest file. This allows the Spring bundle to look it up.

This approach is Equinox-specific rather than a standard OSGi feature. However, because it is only a manifest entry, it should be inexpensive to change our Spring and OSGi integration to something more standard later.

Real world problem #3: Migrating to bundles

OSGi bundles at their best are like another layer of encapsulation over the level of the class with which all developers are familiar. From where we were, moving to a culture of bundles with minimal and well understood dependencies was a different matter. Developers were used to certain areas of the code that “glue together” the product and depend on many things; we called this the core [cue kettle drums sound]. Later, I will discuss how we use declarative services, but one way we’ve been able to make services work is to define commonly used interfaces and beans in no-dependency bundles. (And in this case that no definitely means no.) We then use declarative services to provide the implementation without dependency, so a core is not required. Instead, we have bundles that use and do things, and bundles that provide those things.

Getting developers working in this way requires a culture change. While the shift is ongoing, many have embraced the idea. We chose not to remove the core bundles or refactor them directly in one go. Rather than having n-sided developer battles, we decided to move to the right design in new work. Refactoring can in this case be done later, once ideas spread organically, by training and sticking to good practice in new bundles. Our problem with core bundles does not have to be solved right away, but we are chipping away at it using the no-dependency bundles.

Real world problem #4: The static, non-modular algorithm

Today in our server there around a hundred OSGI declarative services for things like loading files, getting interfaces to hardware, writing data to a fast distributed file system (we use GPFS and Luster), talking to FPGA-based devices by description language, sending text messages to a port on a custom Linux device that controls a detector, and much more besides! In fact, depending on the experiment, the various bridging bundles and device libraries can easily outnumber the scanning algorithm itself.

The scan algorithm is the heart of the data acquisition system. It is one of the parts that brings together separate concepts like devices and file writing and runs them together in order to collect useful data for the user. On the face of it, there wasn’t much wrong with our existing scanning system. Having been honed by several generations of developers (using the standard developer lifetime of seven years), it was pretty fast, robust, and had a useful Jython layer with which to extend it.

The scan did have a problem, though, in that it could not deal with a new file-writing design, which was introduced in a separate project and had to be integrated to our software. It was intended to store data statically and written in a non-modular way, which made it expensive to adapt. We decided to solve the file-writing requirement and at the same time migrate scanning to OSGi. So scanning was one of the few parts of the system, and the most important part, that we did choose to rewrite. The final algorithm spanned a few thousand lines of code and its bundle is shown expanded below.

The OSGi bundle for Diamond Light's scan algorithm

Figure 2. The OSGi bundle for the scan algorithm. Image credit: Matthew Gerring.

The main algorithm of the scan is an iteration over n-dimensional motor positions running objects that manage fork/join pools. I’ve printed part of it here, and it’s also available under an open source license on GitHub.

Listing 1. The main algorithm for Diamond Light's scanning service


for (IPosition pos : moderator.getOuterIterable()) {
                 	 	          
 	  // Check if we are paused, blocks until we are not
 	  boolean continueRunning = checkPaused();
 	  if (!continueRunning) return; // finally block performed
 	          
 	  // Run to the position
    	  manager.invoke(PointStart.class, pos);
 	  positioner.setPosition(pos);   // moveTo in GDA8
 	          
 	  writers.await();               // Wait for the previous write out to return, if any
 	          
    	  nexusScanFileManager.flushNexusFile(); // flush the nexus file
 	  runners.run(pos);          	// GDA8: collectData() / GDA9: run() for Malcolm
 	  writers.run(pos, false);   	// Do not block on the readout, move to the next position immediately.
       	    	    	       	
 	  // Send an event about where we are in the scan
    	  manager.invoke(PointEnd.class, pos);
 	  positionComplete(pos, count, size);
 	  ++count;
 	 }

Devices need  many types of hooks, and three are shown in Listing 1. Note that the object called manager is invoking annotations on the devices taking part in the scan. Annotations give a simple, low-dependency way of creating a device that can respond to the many different parts of a scan. Previously we had used inheritance for this feature, but that became less manageable as the tree became huge (over 10 levels, depending on device).

After upgrading the scanning with our new fast file writing, the power of annotations, and by using fork/join pools, we discovered an unexpected outcome: we had made our scanning about 10 times slower in the benchmark test. In this test we ran old scans and a new scans from the Jython layer, timing the result. On the upside, the new system scaled to millions of points, whereas the original system had started to get slow above tens of thousands of points and would grind to a halt at hundreds of thousands.

After a bit of head scratching and timing various parts of the system (using nothing more complex than test classes) we discovered that startup was taking much longer to process. The thing was that all the dynamic loading at the start of the scan with OSGi was being timed, and it really made an impact on the benchmark used. After changing the test to set up the bundle separately (@BeforeClass), we were left with a system about one millisecond slower per point of non-hardware accelerated scan and much more scalable.

Lesson learned.

Real world problem #5: Cardinality

When we decide to consume or donate a service, a little XML file is declared to be read by editing the manifest. Lars Vogel’s excellent blog on the subject from eight years ago gives the gory details. You have to do some important things to get this to work properly:

  1. MANIFEST.MF
    1. Bundle-ActivationPolicy: lazy
    2. Service-Component: OSGI-INF/*.xml  (or wherever your XML files are)
  2. Do little or ideally no work in the constructor to your service. Then it will work nicely with other services that it consumes.
  3. Make sure the new XML files are built by setting them in build.properties
  4. Use the correct cardinality in your service XML files.

In mathematics, cardinality refers to the number of elements in a set or other grouping, as a property of that grouping. In OSGi you have the following options for a cardinality of a service:

  • 0..1 (meaning zero or one service instances)
  • 0..n (meaning zero to n)
  • 1..1 (one and if it is not available, things start to fail)
  • 1..n (one or more)

Here’s an example of us injecting some services to a class and setting cardinality:

Setting cardinality

Figure 3. Injecting services and setting cardinality. Image credit: Matthew Gerring.

We started off setting most of the files where services are consumed to having a 1..1 cardinality and this worked well for a while. It made sense: we had one instance of each service and required it.

In practice, however, a service sometimes could not resolve; perhaps it had a dependency missing or something it relied on bombed out. If that happens and they are all 1..1 and in one class, you will find that all of your services do not resolve in that class. This leads to errors later on, that are not related to the actual service that had problems. Therefore, we switched to mostly having 0..1 cardinality in classes with many services injected. We rely on an NPE to warn developers of the class of the specific service having errors. For another approach, we have services injected into the class that uses them. In this case there must be a no-argument constructor that does very little work because at OSGi injection time, the whole system might not be up and working.

Real world problem #6: Declarative services

I wrote in Real world problem #4 that our product today has lots of services. It has been surprising how fast the idea of a no-dependency interface implemented by bundles elsewhere has caught on for us. It has not yet happened that every developer in the group is familiar with declarative services, however; rather, several pioneers have taken up the cause in different teams.

Going back a few years, before OSGI, we were prospecting: looking for a better way of doing things. Now different members of the group have taken the idea forward. They have enthusiastically hidden dependencies and created new services, and this has presented its own problem. Although we have one server, different developers provide bundles without knowing the context of how they will be run on other experiments. Not only that, but we also have test products and share bundles with other products; for example, we have a product we use for analysis called DAWN. We also have open source projects that we have or will be donating to the Eclipse Foundation, like Scanning, January and Richbeans. This means that OSGi XML that works fine in one context has the potential to cause warnings in another. Some example warnings can be seen in the log below:

Listing 2. Logged warnings


!ENTRY org.eclipse.equinox.ds 1 0 2016-11-04 15:55:27.344
!MESSAGE Could not bind a reference of component Data Slice Service Holder. The reference is: Reference[name = IImageService, interface = org.eclipse.dawnsci.plotting.api.histogram.IImageService, policy = dynamic, cardinality = 0..1, target = null, bind = setImageService, unbind = null]

!ENTRY org.eclipse.equinox.ds 1 0 2016-11-04 15:55:27.345
!MESSAGE Could not bind a reference of component Data Slice Service Holder. The reference is: Reference[name = IPlotImageService, interface = org.eclipse.dawnsci.plotting.api.image.IPlotImageService, policy = dynamic, cardinality = 0..1, target = null, bind = setPlotImageService, unbind = null]
Starting VMXi SampleHandling Service

!ENTRY org.eclipse.scanning.connector.epicsv3 4 0 2016-11-04 15:55:27.385
!MESSAGE [SCR] Error occurred while processing end tag of XML 'bundleentry://627.fwk1711105800/OSGI-INF/epicsv3DynamicDataset.xml' in bundle org.eclipse.scanning.connector.epicsv3_1.0.0.qualifier [627]!  The 'service' tag must have one 'provide' tag set at line 4 

!ENTRY org.eclipse.equinox.ds 1 0 2016-11-08 10:25:48.544
!MESSAGE Could not bind a reference of component Scanning Servlet Services. The reference is: Reference[name = IEventService, interface = org.eclipse.scanning.api.event.IEventService, policy = dynamic, cardinality = 0..1, target = null, bind = setEventService, unbind = null]

The way around this problem, it seems, is to understand each message concerned. You can do things like adding -Dequinox.ds.print=true, which will give the actual errors and these must be resolved. This is helpful. However, some messages are correct but don’t actually matter. One above for instance is a 0..1 cardinality for a service not yet available. Later in the class loading, this service will resolve and actually work when used. So my advice would be understand all your OSGi error messages and be warned that some really matter, while others will self correct at runtime.

Real world problem #7: The hidden cost of TDD

A significant fraction of our code base is older than our decision to use test-driven development. Then there’s the fact that we sometimes have to add certain features in a hurry for a given experiment. These can, not surprisingly, be areas with limited tests. So everything new we wrote in the OSGi server, we wanted to try to do with a TDD methodology, bolting things down as we go. All the modularity provided by moving to services allowed the new services to be mocked out and a huge number of tests to be created, which led to an interesting problem: all new the tests added significantly to our Jenkins build time, impacting developers in the whole group.

The tests were checking bundles which the rest of the group were unlikely to be changing on the beamline, for instance generic scanning or file writing. So we brought Travis CI to the rescue. Travis runs from a GitHub webhook executing the build (Maven) and test (JUnit) for us whenever a GitHub pull request is submitted to one of the repositories. This means that in the main product we can have a faster build (in-house and Jenkins) because specific API bundles, on a separate GitHub repository can have a separate build and test.

Increased modularity helped break up the build, which lowered time developers spent waiting for a test/build, for instance when doing Gerrit reviews, allowing an increased rate of development.

Have a go! Get the OSGi scanning server code

Diamond Light Source are committed to open source data and open source code. With the help of the Eclipse Foundation we are planning to get most of the parts of our data acquisition system IP-checked.  Follow these instructions to get a toy OSGi scanning server and run it with a user interface and mocked out hardware connection. (Note that you should be familiar with targets and products in Eclipse and with Git.)

  1. Get the code from GitHub:
    
    git clone --depth=50 --branch=master 
    https://github.com/DiamondLightSource/daq-eclipse 
    ./eclipse/org.eclipse.scanning
    
    git clone --depth=50 --branch=master 
    https://github.com/eclipse/richbeans.git 
    ./eclipse/org.eclipse.richbeans
    
    git clone --depth=50 --branch=master 
    https://github.com/eclipse/dawnsci.git 
    ./eclipse/org.eclipse.dawnsci
    
    git clone --depth=50 --branch=master 
    https://github.com/DawnScience/dawn-hdf.git ./dawn-hdf
    
    
    
  2. Import all the projects from the repositories you checked out into your Eclipse workspace. You will need Eclipse with the RCP development tools.
  3. Open the file org.eclipse.scanning.target.platform.fat.target. You need to have Eclipse download these components to your target, which will happen when you open the file. Click the set as target platform link in the top-right corner:
    Target definition

    Figure 4. Set the target platform (click to enlarge). Image credit: Matthew Gerring.

  4. At this point all the projects should compile. You should start the server using the product org.eclipse.scanning.example.server.product and then start the client using the product org.eclipse.scanning.example.client.fat.product. If the server starts correctly you will see the message:
    
    	11:36:15.434 INFO  o.e.scanning.event.ConsumerImpl - X-Ray 
    Centering Consumer Submission ActiveMQ connection to failover:(tcp://localhost:61616)?startupMaxReconnectAttempts=3 made.
     	[Consumer Thread X-Ray Centering Consumer]
    
    
    It starts up a local version of activemq on port 61616. You can configure activemq using command-line options.
  5. Try running a scan by going to the Scanning perspective and drawing a grid scan using the Scan Editor. It looks something like this:
    scanningperspective

    Figure 5. Using the Scan Editor (click to enlarge). Image credit: Matthew Gerring.

Conclusions

Moving mature, complex, and mission-critical software products to dynamic class loading is actually fairly easy, and we certainly should have done it sooner! We have had some details to get over but actually we were able to find solutions without spending a huge amounts of time. The move suited the way we work and improved it. We found mature tools to pick up and use, a clear migration path, and blogs to follow and help with the process. Moving to OSGi is so straightforward, I would recommend anyone considering a dynamic class loading solution to invest some time in it.

Thanks to staff at Diamond Light Source for working on our OSGi upgrade and editing this article for JavaWorld.

IDG Insider

PREVIOUS ARTICLE

«Now you can summon an Uber ride without leaving Google Maps

NEXT ARTICLE

Clip Amazon Echo's ears with a home-brewed 'kill switch'»
author_image
IDG Connect

IDG Connect tackles the tech stories that matter to you

Add Your Comment

Recommended for You

kathryn-cave

Blockchain For Dummies: What you really need to know

Kathryn Cave looks at the big trends in global tech

martin-veitch-thumbnail

What we know and don’t know about digital transformation

Martin Veitch's inside track on today’s tech trends

silhouette

Four hot IT growth areas to guarantee a big salary bump

IDG Connect tackles the tech stories that matter to you

Most Recent Comments

Our Case Studies

IDG Connect delivers full creative solutions to meet all your demand generatlon needs. These cover the full scope of options, from customized content and lead delivery through to fully integrated campaigns.

images

Our Marketing Research

Our in-house analyst and editorial team create a range of insights for the global marketing community. These look at IT buying preferences, the latest soclal media trends and other zeitgeist topics.

images

Poll

Should companies have Bitcoins on hand in preparation for a Ransomware attack?