Using C# regex to count the number of words between two tags
Apr
12
Written by:
4/12/2012 4:39 PM
Currently making extractions of messages that should contain some keywords, sometimes the number of occurrences retrieved is too big. The only way is either to redefine the keywords (HI% could become HIGH for instance) or to count the number of words between two keywords.
As usual, most of the work is to find the right regular expression that will give the right results. The tool used there is the
C# regex.
All tests were done first using the Free Regular Expression Designer, provided by RAD software there: http://www.radsoftware.com.au/regexdesigner.
In order to get the words between two tags, the following pattern should be used:
key0(.*?)key1
Where key0 and key1 are the keywords used get the extractions. Depending of the types of keywords, we have to add '\b' before and/or after the keyword.
For example, LO% will be be transformed into \bLO and EUR will be transformed into \bEUR\b.
If the keywords are dispatched on several lines, when creating the Regex object, we have to declare the Singleline option, which permits to consider the input text as in a single line.
Once the sentence retrieved, we have to count the number of words in the sentence, that could be done like that: System.Text.RegularExpressions.Regex.Matches(input, @"[\S]+").Count;
At the end, the C# code looks like:
01.
static
int
CountMinimumWordsBetweenKeys(
string
input,
string
key0,
string
key1,
int
min)
02.
{
03.
int
_result = -1;
04.
string
_pattern = key0+
"(.*?)"
+key1;
05.
System.Text.RegularExpressions.Regex _regex =
new
System.Text.RegularExpressions.Regex(_pattern,
06.
07.
System.Text.RegularExpressions.RegexOptions.Singleline |
08.
09.
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
10.
System.Text.RegularExpressions.MatchCollection _matches = _regex.Matches(input);
11.
foreach
(System.Text.RegularExpressions.Match _m
in
_matches)
12.
{
13.
string
_value = String.Empty;
14.
if
(_m.Groups.Count>1)
15.
_value = _m.Groups[1].Value;
16.
_result = CountWords(_value);
17.
if
(_result != -1 && _result <= min)
18.
break
;
19.
}
20.
if
(_result==-1 || _result>min)
// try the other order
21.
{
22.
_pattern = key1 +
"(.*?)"
+ key0;
23.
_regex =
new
System.Text.RegularExpressions.Regex(_pattern,
24.
25.
System.Text.RegularExpressions.RegexOptions.Singleline |
26.
27.
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
28.
_matches = _regex.Matches(input);
29.
foreach
(System.Text.RegularExpressions.Match _m
in
_matches)
30.
{
31.
string
_value = String.Empty;
32.
if
(_m.Groups.Count > 1)
33.
_value = _m.Groups[1].Value;
34.
_result = CountWords(_value);
35.
if
(_result != -1 && _result <= min)
36.
break
;
37.
}
38.
}
39.
return
_result;
40.
}