ITQuants blog

Using C# regex to count the number of words between two tags

Apr 12

Written by:
4/12/2012 4:39 PM  RssIcon

Currently making extractions of messages that should contain some keywords, sometimes the number of occurrences retrieved is too big. The only way is either to redefine the keywords (HI% could become HIGH for instance) or to count the number of words between two keywords.

As usual, most of the work is to find the right regular expression that will give the right results. The tool used there is the C# regex.

All tests were done first using the Free Regular Expression Designer, provided by RAD software there: http://www.radsoftware.com.au/regexdesigner.

In order to get the words between two tags, the following pattern should be used:

key0(.*?)key1

Where key0 and key1 are the keywords used get the extractions. Depending of the types of keywords, we have to add '\b' before and/or after the keyword.

For example, LO% will be be transformed into \bLO and EUR will be transformed into \bEUR\b.

If the keywords are dispatched on several lines, when creating the Regex object, we have to declare the Singleline option, which permits to consider the input text as in a single line. 

Once the sentence retrieved, we have to count the number of words in the sentence, that could be done like that: System.Text.RegularExpressions.Regex.Matches(input, @"[\S]+").Count;

At the end, the C# code looks like:

 

01.static int CountMinimumWordsBetweenKeys(string input, string key0, string key1, int min)
02.        {
03.            int _result = -1;
04.            string _pattern = key0+"(.*?)"+key1;
05.            System.Text.RegularExpressions.Regex _regex = new System.Text.RegularExpressions.Regex(_pattern,
06. 
07.System.Text.RegularExpressions.RegexOptions.Singleline |
08. 
09.System.Text.RegularExpressions.RegexOptions.IgnoreCase);
10.            System.Text.RegularExpressions.MatchCollection _matches = _regex.Matches(input);
11.            foreach (System.Text.RegularExpressions.Match _m in _matches)
12.            {
13.                string _value = String.Empty;
14.                if(_m.Groups.Count>1)
15.                    _value = _m.Groups[1].Value;
16.                _result = CountWords(_value);
17.                if (_result != -1 && _result <= min)
18.                    break;
19.            }
20.            if(_result==-1 || _result>min) // try the other order
21.            {
22.                _pattern = key1 + "(.*?)" + key0;
23.                _regex = new System.Text.RegularExpressions.Regex(_pattern,
24. 
25.System.Text.RegularExpressions.RegexOptions.Singleline |
26. 
27.System.Text.RegularExpressions.RegexOptions.IgnoreCase);
28.                _matches = _regex.Matches(input);
29.                foreach (System.Text.RegularExpressions.Match _m in _matches)
30.                {
31.                    string _value = String.Empty;
32.                    if (_m.Groups.Count > 1)
33.                        _value = _m.Groups[1].Value;
34.                    _result = CountWords(_value);
35.                    if (_result != -1 && _result <= min)
36.                        break;
37.                }
38.            }
39.            return _result;
40.        }


Tags:
Categories: C#

1 comment(s) so far...


Gravatar

Re: Using C# regex to count the number of words between two tags

On peut faire la même chose avec une seule Regex:
\bJOHN\W+((?:\w+\W+)){0,2}?room\b
doit fonctionner avec "John has joined the room"
{0,2} veut dire "entre 1 et 3 mots entre "JOHN" et "room"
source: www.regular-expressions.info/near.html

By Mathias Kluba on   9/13/2012 4:46 PM

Search blog