Splitting a string with byte length limits in java

KYHSGeekCode :

I want to split a String to a String[] array, whose elements meet following conditions.

  • s.getBytes(encoding).length should not exceed maxsize(int).

  • If I join the splitted strings with StringBuilder or + operator, the result should be exactly the original string.

  • The input string may have unicode characters which can have multiple bytes when encoded in e.g. UTF-8.

The desired prototype is shown below.

public static String[] SplitStringByByteLength(String src,String encoding, int maxsize)

And the testing code:

public boolean isNice(String str, String encoding, int max)
{
    //boolean success=true;
    StringBuilder b=new StringBuilder();
    String[] splitted= SplitStringByByteLength(str,encoding,max);
    for(String s: splitted)
    {
        if(s.getBytes(encoding).length>max)
            return false;
        b.append(s);
    }
    if(str.compareTo(b.toString()!=0)
        return false;
    return true;
}

Though it seems easy when the input string has only ASCII characters, the fact that it could cobtain multibyte characters makes me confused.

Thank you in advance.

Edit: I added my code impementation. (Inefficient)

public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) throws UnsupportedEncodingException
{
    ArrayList<String> splitted=new ArrayList<String>();
    StringBuilder builder=new StringBuilder();
    //int l=0;
    int i=0;
    while(true)
    {
        String tmp=builder.toString();
        char c=src.charAt(i);
        if(c=='\0')
            break;
        builder.append(c);
        if(builder.toString().getBytes(encoding).length>maxsize)
        {
            splitted.add(new String(tmp));
            builder=new StringBuilder();
        }
        ++i;
    }
    return splitted.toArray(new String[splitted.size()]);
}

Is this the only way to solve this problem?

Serge Ballesta :

The class CharsetEncode has provision for your requirement. Extract from the Javadoc of the Encode method:

public final CoderResult encode(CharBuffer in,
                            ByteBuffer out,
                            boolean endOfInput)

Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer...

In addition to reading characters from the input buffer and writing bytes to the output buffer, this method returns a CoderResult object to describe its reason for termination:

...

CoderResult.OVERFLOW indicates that there is insufficient space in the output buffer to encode any more characters. This method should be invoked again with an output buffer that has more remaining bytes. This is typically done by draining any encoded bytes from the output buffer.

A possible code could be:

public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
    Charset cs = Charset.forName(encoding);
    CharsetEncoder coder = cs.newEncoder();
    ByteBuffer out = ByteBuffer.allocate(maxsize);  // output buffer of required size
    CharBuffer in = CharBuffer.wrap(src);
    List<String> ss = new ArrayList<>();            // a list to store the chunks
    int pos = 0;
    while(true) {
        CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
        int newpos = src.length() - in.length();
        String s = src.substring(pos, newpos);
        ss.add(s);                                  // add what has been encoded to the list
        pos = newpos;                               // store new input position
        out.rewind();                               // and rewind output buffer
        if (! cr.isOverflow()) {
            break;                                  // everything has been encoded
        }
    }
    return ss.toArray(new String[0]);
}

This will split the original string in chunks that when encoded in bytes fit as much as possible in byte arrays of the given size (assuming of course that maxsize is not ridiculously small).

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=475117&siteId=1