SBT: How to package an instance of a class as a JAR?

pathikrit :

I have code which essentially looks like this:

class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
  def train(): FoodClassifier       // Very expensive - takes ~5 hours!
}

class FoodClassifier {          // Light-weight API class
  def isHotDog(input: Image): Boolean
}

I want to at JAR-assembly (sbt assembly) time, invoke val classifier = new FoodTrainer(s3Dir).train() and publish the JAR which has the classifier instance instantly available to downstream library users.

What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

How do I do this using sbt assembly where I do not have to check in a large model class or data file into my version control?

pathikrit :

Okay I managed to do this:

  1. Separate the food-trainer module into 2 separate SBT sub-modules: food-trainer and food-model. The former is only invoked at compile time to create the model and serialize into the generated resources of the latter. The latter serves as a simple factory object to instantiate a model from the serialized version. Every downstream project only depends on this food-model submodule.

  2. The food-trainer has the bulk of all the code and has a main method that can serialize the FoodModel:

    object FoodTrainer {
      def main(args Array[String]): Unit = {
        val input = args(0)
        val outputDir = args(1)
        val model: FoodModel = new FoodTrainer(input).train() 
        val out = new ObjectOutputStream(new File(outputDir + "/model.bin"))
        out.writeObject(model)
      }
    }
    
  3. Add a compile-time task to generate the food trainer module in your build.sbt:

    lazy val foodTrainer = (project in file("food-trainer"))
    
    lazy val foodModel = (project in file("food-model"))
      .dependsOn(foodTrainer)
      .settings(    
         resourceGenerators in Compile += Def.task {
           val log = streams.value.log
           val dest = (resourceManaged in Compile).value   
           IO.createDirectory(dest)
           runModuleMain(
             cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}",
             cp = (fullClasspath in Runtime in foodTrainer).value.files,
             log = log
           )             
          Seq(dest / "model.bin")
        }
    
    def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = {
      log.info(s"Running $cmd")
      val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log)))
      val res = Fork.scala(config = opt, arguments = cmd.split(' '))
      require(res == 0, s"$cmd exited with code $res")
    }
    
  4. Now in your food-model module, you have something like this:

    object FoodModel {
      lazy val model: FoodModel =
        new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel])
    }
    

Every downstream project now only depends on food-model and simply uses FoodModel.model. We get the benefit of:

  1. This being statically loaded fast at runtime from the JAR's packaged resources
  2. No need to train the model at runtime (very expensive)
  3. No need to checking-in the model in your version control (again the binary model is very big) - it is only packaged into your JAR
  4. No need to separate the FoodTrainer and FoodModel packages into their own JARs (we have the headache of deploying them internally now) - instead we simply keep them in the same project but different sub-modules which gets packed into a single JAR.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=461142&siteId=1