On the image data source built Spark

Outline

    Incorporated Apache Spark 2.4 in a new built-in data source, the image data source. The user can load the specified directory by DataFrame the API of the image file to generate a DataFrame object. The user can perform simple processing on the image data by the DataFrame objects , then MLlib specific training and classification for calculation.
    This article describes the implementation details and use of image data sources.

Simple to use

    First by a simple example to understand the source of image data using the methods of the present embodiment are set in a set of image files stored on the OSS Ali cloud, this group need watermarked image, and stores the compressed file parquet Nonsense do not say, first on the code:

  // 为了突出重点,代码简化图像格式相关的处理逻辑
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]")
    val spark = SparkSession.builder()
      .config(conf)
      .getOrCreate()

    val imageDF = spark.read.format("image").load("oss://<bucket>/path/to/src/dir")
    imageDF.select("image.origin", "image.width", "image.height", "image.nChannels", "image.mode", "image.data")
        .map(row => {
          val origin = row.getAs[String]("origin")
          val width = row.getAs[Int]("width")
          val height = row.getAs[Int]("height")
          val mode = row.getAs[Int]("mode")
          val nChannels = row.getAs[Int]("nChannels")
          val data = row.getAs[Array[Byte]]("data")
          Row(Row(origin, height, width, nChannels, mode,
            markWithText(width, height, BufferedImage.TYPE_3BYTE_BGR, data, "EMR")))
        }).write.format("parquet").save("oss://<bucket>/path/to/dst/dir")
  }

  def markWithText(width: Int, height: Int, imageType: Int, data: Array[Byte], text: String): Array[Byte] = {
    val image = new BufferedImage(width, height, imageType)
    val raster = image.getData.asInstanceOf[WritableRaster]
    val pixels = data.map(_.toInt)
    raster.setPixels(0, 0, width, height, pixels)
    image.setData(raster)
    val buffImg = new BufferedImage(width, height, imageType)
    val g = buffImg.createGraphics
    g.drawImage(image, 0, 0, null)
    g.setColor(Color.red)
    g.setFont(new Font("宋体", Font.BOLD, 30))
    g.drawString(text, width/2, height/2)
    g.dispose()
    val buffer = new ByteArrayOutputStream
    ImageIO.write(buffImg, "JPG", buffer)
    buffer.toByteArray
  }

Parquet extracted from the generated binary image data of a document, save to local jpg, results as follows:

1
image1


FIG. 1 is a left original image, the right picture image processing


You may have noticed that the two images are not the same color, because Spark image data decoded image data BGR order, and sample programs in saved time, not to deal with this transformation, leading to the emergence of color contrast.

First Look achieve

Here our in-depth look at the source code to spark achieve implementation code details .Apache Spark built-in image data source consists of two classes in spark-mllib this module:

  • org.apache.spark.ml.image.ImageSchema
  • org.apache.spark.ml.source.image.ImageFileFormat

Wherein, ImageSchema loaded as an image file defined in Row DataFrame format and decoding methods .ImageFileFormat provides an interface for reading and writing the storage layer.

Format definition

After the image file is loaded as a DataFrame, corresponding to the following:

  val columnSchema = StructType(
    StructField("origin", StringType, true) ::
    StructField("height", IntegerType, false) ::
    StructField("width", IntegerType, false) ::
    StructField("nChannels", IntegerType, false) ::
    // OpenCV-compatible type: CV_8UC3 in most cases
    StructField("mode", IntegerType, false) ::
    // Bytes in OpenCV-compatible order: row-wise BGR in most cases
    StructField("data", BinaryType, false) :: Nil)

  val imageFields: Array[String] = columnSchema.fieldNames
  val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil)

If the DataFrame printed, can be obtained in the form of a table:

+--------------------+-----------+------------+---------------+----------+-------------------+
|image.origin        |image.width|image.height|image.nChannels|image.mode|image.data         |
+--------------------+-----------+------------+---------------+----------+-------------------+
|oss://.../dir/1.jpg |600        |343         |3              |16        |55 45 21 56  ...   |
+--------------------+-----------+------------+---------------+----------+-------------------+

among them:

  • Path of the original image file: origin
  • width: width of the image in pixels
  • height: height of the image in pixels
  • nChannels: number of channels of the image, such as the common channels of RGB bitmap 3
  • mode: Type Value Type pixel matrix of elements (data) and channel order, compatible with OpenCV
  • data: decoded pixel matrix

Tip: On the basis of support for images, you can refer to the following documentation:  Image File Reading and Writing

Loading and decoding

The image file is loaded into a Row object by ImageFileFormat.

// 文件: ImageFileFormat.scala
// 为了简化说明起见,代码有删减和改动
private[image] class ImageFileFormat extends FileFormat with DataSourceRegister {
  ......

  override def prepareWrite(
      sparkSession: SparkSession,
      job: Job,
      options: Map[String, String],
      dataSchema: StructType): OutputWriterFactory = {
    throw new UnsupportedOperationException("Write is not supported for image data source")
  }

  override protected def buildReader(
      sparkSession: SparkSession,
      dataSchema: StructType,
      partitionSchema: StructType,
      requiredSchema: StructType,
      filters: Seq[Filter],
      options: Map[String, String],
      hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow] = {    
    ......
    (file: PartitionedFile) => {
    ......
        val path = new Path(origin)
        val stream = fs.open(path)
        val bytes = ByteStreams.toByteArray(stream)
        val resultOpt = ImageSchema.decode(origin, bytes) // <-- 解码 
        val filteredResult = Iterator(resultOpt.getOrElse(ImageSchema.invalidImageRow(origin)))
     ......
          val converter = RowEncoder(requiredSchema)
          filteredResult.map(row => converter.toRow(row))
     ......
      }
    }
  }
}

It can be seen from:

  • The current source implementation does not support image data save operation;
  • Work decoded image data is completed in the ImageSchema.

Let's look at the specific decoding process:

// 文件: ImageSchema.scala
// 为了简化说明起见,代码有删减和改动
private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = {
        // 使用ImageIO加载原始图像数据
    val img = ImageIO.read(new ByteArrayInputStream(bytes))
    if (img != null) {
      // 获取图像的基本属性
      val isGray = img.getColorModel.getColorSpace.getType == ColorSpace.TYPE_GRAY
      val hasAlpha = img.getColorModel.hasAlpha
      val height = img.getHeight
      val width = img.getWidth
      // ImageIO::ImageType -> OpenCV Type
      val (nChannels, mode) = if (isGray) {
        (1, ocvTypes("CV_8UC1"))
      } else if (hasAlpha) {
        (4, ocvTypes("CV_8UC4"))
      } else {
        (3, ocvTypes("CV_8UC3"))
      }
            // 解码
      val imageSize = height * width * nChannels
      // 用于存储解码后的像素矩阵
      val decoded = Array.ofDim[Byte](imageSize)
      if (isGray) {
        // 处理单通道图像
        ...
      } else {
        // 处理多通道图像
        var offset = 0
        for (h <- 0 until height) {
          for (w <- 0 until width) {
            val color = new Color(img.getRGB(w, h), hasAlpha)
            // 解码后的通道顺序为BGR(A)
            decoded(offset) = color.getBlue.toByte
            decoded(offset + 1) = color.getGreen.toByte
            decoded(offset + 2) = color.getRed.toByte
            if (hasAlpha) {
              decoded(offset + 3) = color.getAlpha.toByte
            }
            offset += nChannels
          }
        }
      }
      // 转换为一行数据
      Some(Row(Row(origin, height, width, nChannels, mode, decoded)))
    }
  }

It can be seen from:

  • The data sources used on the realization javax ImageIO library for image file decoding .ImageIO all kinds of formats though is a very strong and professional java image processing library, but more professional CV and libraries (such as OpenCV) compared the performance on the function and the gap is still very large;
  • And a pixel value of the image sequence type channel decoded is fixed, the fixed order of BGR (A), the pixel value type 8U;
  • Supports up to four channels, such as multispectral image may contain dozens of bands of image information can not be supported;
  • Information decoded output contains only the basic length and width and the like, the number of channels, and mode fields, if the need to obtain more detailed metadata, such as exif, GPS coordinates, etc. to render you;
  • Data source performed when generating DataFrame the decoding operation image data is stored and decoded in Java within the heap memory. This is the actual project should be a more extensive implementation, take up a lot of resources, including memory and bandwidth ( If shuffle happens, you can consider a reference image file is saved with a size difference of BMP and JPG).

Encoding and storage

As can be seen from the analysis, the current image of the data source does not support the processed pixel matrix encoded and saved in the specified image file formats.

Image processing capabilities

The current version of Apache Spark does not provide for the UDF image data, the image data processing needs the ImageIO library or other more professional CV library.

summary

The current Apache Spark built-in image data source may be more convenient loading image files for analysis. However, the current implementation is also very simple, performance and resource consumption should never be too optimistic. In addition, the current version only provides the ability to load image data , does not provide packaging and processing algorithms commonly implemented, it can not well supported by the more specialized business analysis CV vertical field course, and this image data is positioned in Spark source relevant (image data as input for the DL training model, not many of these tasks itself requires processing of the image) If you want to use the Spark framework to complete a more realistic image processing tasks, there are a lot of work to do, such as:

  • Support richer metadata model
  • Use more professional and more flexible codec library Codec Process Control
  • CV package for operators and functions
  • More efficient memory management
  • GPU support

So on and so work, limited space, here is not started.
Well, more to say, Spark now has support for the processing of the image data (although limited support), then the video stream data be far behind?

Guess you like

Origin yq.aliyun.com/articles/705464