Wednesday, December 28, 2011

The java way of highlighting an HTML content with JSoup and Regex (works for android too)

When I wanted to do this the first time, I tried to search on Stack Overflow and other Q&A forums but in vain. There are very few places where this topic has been discussed and none really concluded it. So here's some solution I formulated with parts of it flicked from various places.

What I wanted to do was: fetch html from a web page into my java application; get a search string from the application user; build a regex out of search string and highlight html content that matched the regex. This is quite easy to do this with a javascript but my use case required this to be done in the java code.

Basically I required this piece of code around the text that should be highlighted.


<span style="background-color:yellow"> Text that matched regex </span> 


Building the regex from search string

Suppose my search string was 'hello world' then my regex would be (hello)|(world*).

Here's the function that will do it.

private static String buildRegexFromQuery(String queryString) {
        String regex = "";
        String queryToConvert = queryString;

        /* Clean up query */

        queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " ");
        queryToConvert = queryToConvert.replaceAll("[\\s]*", " ");

        String[] regexArray = queryString.split(" ");

        regex = "(";
        for(int i = 0; i < regexArray.length - 1; i++) {
            String item = regexArray[i];
            regex += "(\\b)" + item + "(\\b)|";
        }

        regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))";
        return regex;
    }

Searching and replacing proper html content

I cannot just do a simple String.replaceAll() because that would mess the whole thing when matched text is inside the tags. So I was pretty sure I required a html parser. Then I met the beautiful. JSoup.

JSoup made things quite easy to me. I just had to traverse the HTML DOM tree, retrieve text and make changes. With the NodeTraversor and NodeVisitor class it is a piece of cake to traverse the html content.

     public String getHighlightedHtml() {
    
     Document doc = Jsoup.parse(htmlContent);
    
     final List<TextNode> nodesToChange = new ArrayList<TextNode>();
    
     NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {
    
      @Override
      public void tail(Node node, int depth) {
       if (node instanceof TextNode) {
        TextNode textNode = (TextNode) node;
        String text = textNode.getWholeText();
         
        mat = pat.matcher(text);
        
        if(mat.find()) {
         nodesToChange.add(textNode);
        }
       }
      }
    
      @Override
      public void head(Node node, int depth) {        
      }
     });
    
     nd.traverse(doc.body());
    
     for (TextNode textNode : nodesToChange) {
      Node newNode = buildElementForText(textNode);
      textNode.replaceWith(newNode);
     }
     return doc.toString();
    }

Wrapping it all...

Here's my final class to wrap things up -

public class Highlighter {
    
   private String regex;
     private String htmlContent;
     Pattern pat;
     Matcher mat;
    
    
     public Highlighter(String searchString, String htmlString) {
      regex = buildRegexFromQuery(searchString);
      htmlContent = htmlString;
      pat = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
     }
    
     public String getHighlightedHtml() {
    
      Document doc = Jsoup.parse(htmlContent);
    
      final List<TextNode> nodesToChange = new ArrayList<TextNode>();
    
      NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {
    
       @Override
       public void tail(Node node, int depth) {
        if (node instanceof TextNode) {
         TextNode textNode = (TextNode) node;
         String text = textNode.getWholeText();
         
         mat = pat.matcher(text);
         
         if(mat.find()) {
          nodesToChange.add(textNode);
         }
        }
       }
    
       @Override
       public void head(Node node, int depth) {        
       }
      });
    
      nd.traverse(doc.body());
    
      for (TextNode textNode : nodesToChange) {
       Node newNode = buildElementForText(textNode);
       textNode.replaceWith(newNode);
      }
      return doc.toString();
     }
    
     private static String buildRegexFromQuery(String queryString) {
      String regex = "";
      String queryToConvert = queryString;
      
      /* Clean up query */
      
      queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " ");
      queryToConvert = queryToConvert.replaceAll("[\\s]*", " ");
      
      String[] regexArray = queryString.split(" ");
      
      regex = "(";
      for(int i = 0; i < regexArray.length - 1; i++) {
       String item = regexArray[i];
       regex += "(\\b)" + item + "(\\b)|";
      }
    
      regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))";
      return regex;
     }
    
     private Node buildElementForText(TextNode textNode) {
      String text = textNode.getWholeText().trim();
      
      ArrayList<MatchedWord> matchedWordSet = new ArrayList<MatchedWord>();
      
      mat = pat.matcher(text);
      
      while(mat.find()) {
       matchedWordSet.add(new MatchedWord(mat.start(), mat.end()));
      }
      
      StringBuffer newText = new StringBuffer(text);
    
      for(int i = matchedWordSet.size() - 1; i >= 0; i-- ) {
       String wordToReplace = newText.substring(matchedWordSet.get(i).start, matchedWordSet.get(i).end);
       wordToReplace = "<b>" + wordToReplace+ "</b>";
       newText = newText.replace(matchedWordSet.get(i).start, matchedWordSet.get(i).end, wordToReplace);  
      }
      return new DataNode(newText.toString(), textNode.baseUri());
     }
     
     class MatchedWord {
      public int start;
      public int end;
      
      public MatchedWord(int start, int end) {
       this.start = start;
       this.end = end;
      }
     }
    }


Don't forget to add the JSoup library to your build path. Download JSoup from here

No comments:

Post a Comment