I have code which essentially looks like this:
class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
def train(): FoodClassifier // Very expensive - takes ~5 hours!
}
class FoodClassifier { // Light-weight API class
def isHotDog(input: Image): Boolean
}
I want to at JAR-assembly (sbt assembly
) time, invoke val classifier = new FoodTrainer(s3Dir).train()
and publish the JAR which has the classifier
instance instantly available to downstream library users.
What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar
How do I do this using sbt assembly
where I do not have to check in a large model class or data file into my version control?
Okay I managed to do this:
Separate the food-trainer module into 2 separate SBT sub-modules:
food-trainer
andfood-model
. The former is only invoked at compile time to create the model and serialize into the generated resources of the latter. The latter serves as a simple factory object to instantiate a model from the serialized version. Every downstream project only depends on thisfood-model
submodule.The
food-trainer
has the bulk of all the code and has a main method that can serialize theFoodModel
:object FoodTrainer { def main(args Array[String]): Unit = { val input = args(0) val outputDir = args(1) val model: FoodModel = new FoodTrainer(input).train() val out = new ObjectOutputStream(new File(outputDir + "/model.bin")) out.writeObject(model) } }
Add a compile-time task to generate the food trainer module in your
build.sbt
:lazy val foodTrainer = (project in file("food-trainer")) lazy val foodModel = (project in file("food-model")) .dependsOn(foodTrainer) .settings( resourceGenerators in Compile += Def.task { val log = streams.value.log val dest = (resourceManaged in Compile).value IO.createDirectory(dest) runModuleMain( cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}", cp = (fullClasspath in Runtime in foodTrainer).value.files, log = log ) Seq(dest / "model.bin") } def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = { log.info(s"Running $cmd") val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log))) val res = Fork.scala(config = opt, arguments = cmd.split(' ')) require(res == 0, s"$cmd exited with code $res") }
Now in your
food-model
module, you have something like this:object FoodModel { lazy val model: FoodModel = new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel]) }
Every downstream project now only depends on food-model
and simply uses FoodModel.model
. We get the benefit of:
- This being statically loaded fast at runtime from the JAR's packaged resources
- No need to train the model at runtime (very expensive)
- No need to checking-in the model in your version control (again the binary model is very big) - it is only packaged into your JAR
- No need to separate the
FoodTrainer
andFoodModel
packages into their own JARs (we have the headache of deploying them internally now) - instead we simply keep them in the same project but different sub-modules which gets packed into a single JAR.