How to separate many different words from a string (Java)

Erandall :

I've been struggling to figure out how to get a word, of unknown length, from a string, of unknown length, that I'm reading from a file. The words I want from the string are always separated by "." and/or "&" with the whole string being surrounded by quotes. EX: ".Word.Characters&Numeric&Letters.Typos&Mistypes." I know the location of each "." and "&" as well as how many times they occur.

I want to feed the words into an array Example[i][j] based on whether or not the words are separated by a "." or a "&". So words contained between "." would be set into the i column of the array and words linked by "&" into the j rows of the array.

The input string can contain a largely variable number of words. Meaning that there can be only one word of interest, or one hundred+.

I'd prefer to use arrays to solve this problem. From what I've read regex would be slow, but work. split() may also work, but I think I'd have to know what words to look for before hand.

From this String: ".Word.Characters&Numeric&Letters.Typos&Mistypes." I'd expect to get: (without worrying about which is a row or column)

[[Word],[null],[null]],

[[Characters],[Numbers],[Letters]],

[[Typos],[Mistypes],[null]]

From this String ".Alpha.Beta.Zeta&Iota." I'd expect to get:

[[Alpha],[null]],

[[Beta],[null]],

[[Zeta],[Iota]]

//NumerOfPeriods tells me how many word "sections" are in the string
//Stor[] is an array that holds the string index locations of "."
for(int i=0;i<NumberOfPeriods;i++)
{
    int length = Stor[i];
    while(Line.charAt(length) != '"')
    {
        length++;
    }
    Example[i] = Line.substring(Stor[i], length);
}
//This code can get the words separated by "." but not by "&"

//Stor[] is an array that holds all string index locations of '.'
//AmpStor[] is an array that holds all string index locations of '&'
int TotalLength = Stor[0];
int InnerLength = 0;
int OuterLength = 0;
while(Line.charAt(TotalLength) != '"')
{
    while(Line.charAt(OuterLength)!='.')
    {
        while(Line.charAt(InnerLength)!='&')
        {
            InnerLength++;
        }
        if(Stor[i] > AmpStor[i])
        {
            Example[i][j] = Line.substring(Stor[i], InnerLength);
        }
        if(Stor[i] < AmpStor[i])
        {
            Example[i][j] = Line.substring(AmpStor[i],InnerLength);
        }
            OuterLength++;
    }
}
//Here I run into the issue of indexing into different parts of the array i & j
RgSW :

This is how I would solve your problem (it's completely different from your code but it works).

First of all, remove the quotes and the leading and trailing non-word characters. This can be done using replaceAll:

String Formatted = Line.replaceAll( "(^\"[.&]*)|([.&]*\"$)", "" );

The regular expression in the first argument will match the double quotes at both ends and the leading and trailing .s and &s. The method will return a new string where the matched characters are removed, because the second argument is an empty string (it replaces with an empty string).

Now you can split this string at each . using the split method. You could only define your output array after this call:

String[] StringGroups = Formatted.split( "\\." );
String[][] Elements = new String[StringGroups.length][];

Use an escaped backslash (\\) before the point to indicate that it should split on .-characters, since this method takes in a regular expression (and just . splits on any non-newline character).

Now split each string in that array at each & using the same split method. Add the result directly to your Elements array:

// Loop over the array
int MaxLength = 0;
for( int i = 0; i < StringGroups.length; i ++ ) {
   String StrGroup = StringGroups[ i ];
   String[] Group = StrGroup.split( "&" );
   Elements[ i ] = Group;

   // Measure the max length
   if( Group.length > MaxLength ) {
       MaxLength = Group.length;
   }
}

A \\ is not necessary for the input, since & just matches &-characters. Now you only have to fill in your data into an array. The MaxLength variable is for adding the null values to your array. If you don't want them, just remove them and you're done here.

If you want the null values however, loop over your elements array and copy the current rows into new arrays:

for( int i = 0; i < Elements.length; i ++ ) {
    String[] Current = Elements[ i ];
    String[] New = new String[ MaxLength ];

    // Copy existing values into new array, extra values remain null
    System.arraycopy( Current, 0, New, 0, Current.length );
    Elements[ i ] = New;
}

Now, the Elements array contains exactly what you wanted.

Here is the complete executable code:

public class StringSplitterExample {
    public static void main( String[] args ) {
        test( "\".Word.Characters&Numeric&Letters.Typos&Mistypes.\"" );
        System.out.println(); // Line between
        test( "\".Alpha.Beta.Zeta&Iota.\"" );
    }

    public static void test( String Line ) {
        String Formatted = Line.replaceAll( "(^\"[.&]*)|([.&]*\"$)", "" );
        String[] StringGroups = Formatted.split( "\\." );
        String[][] Elements = new String[StringGroups.length][];

        // Loop over the array
        int MaxLength = 0;
        for( int i = 0; i < StringGroups.length; i ++ ) {
            String StrGroup = StringGroups[ i ];
            String[] Group = StrGroup.split( "&" );
            Elements[ i ] = Group;

            // Measure the max length
            if( Group.length > MaxLength ) {
                MaxLength = Group.length;
            }
        }

        for( int i = 0; i < Elements.length; i ++ ) {
            String[] Current = Elements[ i ];
            String[] New = new String[ MaxLength ];

            // Copy existing values into new array, extra values remain null
            System.arraycopy( Current, 0, New, 0, Current.length );
            Elements[ i ] = New;
        }

        for( String[] Group : Elements ) {
            for( String String : Group ) {
                System.out.print( String );
                System.out.print( " " );
            }
            System.out.println();
        }
    }
}

The output of this example:

Word null null 
Characters Numeric Letters 
Typos Mistypes null 

Alpha null 
Beta null 
Zeta Iota 

So this works, and you don't even need to know where the . and & characters are in your string. Java will just do that for you.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=97732&siteId=1