When I wanted to do this the first time, I tried to search on Stack Overflow and other Q&A forums but in vain. There are very few places where this topic has been discussed and none really concluded it. So here's some solution I formulated with parts of it flicked from various places.
What I wanted to do was: fetch html from a web page into my java application; get a search string from the application user; build a regex out of search string and highlight html content that matched the regex. This is quite easy to do this with a javascript but my use case required this to be done in the java code.
Basically I required this piece of code around the text that should be highlighted.
Building the regex from search string
Suppose my search string was 'hello world' then my regex would be (hello)|(world*).
Here's the function that will do it.
Searching and replacing proper html content
I cannot just do a simple String.replaceAll() because that would mess the whole thing when matched text is inside the tags. So I was pretty sure I required a html parser. Then I met the beautiful. JSoup.
JSoup made things quite easy to me. I just had to traverse the HTML DOM tree, retrieve text and make changes. With the NodeTraversor and NodeVisitor class it is a piece of cake to traverse the html content.
Wrapping it all...
Here's my final class to wrap things up -
Don't forget to add the JSoup library to your build path. Download JSoup from here.
What I wanted to do was: fetch html from a web page into my java application; get a search string from the application user; build a regex out of search string and highlight html content that matched the regex. This is quite easy to do this with a javascript but my use case required this to be done in the java code.
Basically I required this piece of code around the text that should be highlighted.
<span style="background-color:yellow"> Text that matched regex </span>
Building the regex from search string
Suppose my search string was 'hello world' then my regex would be (hello)|(world*).
Here's the function that will do it.
private static String buildRegexFromQuery(String queryString) { String regex = ""; String queryToConvert = queryString; /* Clean up query */ queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " "); queryToConvert = queryToConvert.replaceAll("[\\s]*", " "); String[] regexArray = queryString.split(" "); regex = "("; for(int i = 0; i < regexArray.length - 1; i++) { String item = regexArray[i]; regex += "(\\b)" + item + "(\\b)|"; } regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))"; return regex; }
Searching and replacing proper html content
I cannot just do a simple String.replaceAll() because that would mess the whole thing when matched text is inside the tags. So I was pretty sure I required a html parser. Then I met the beautiful. JSoup.
JSoup made things quite easy to me. I just had to traverse the HTML DOM tree, retrieve text and make changes. With the NodeTraversor and NodeVisitor class it is a piece of cake to traverse the html content.
public String getHighlightedHtml() { Document doc = Jsoup.parse(htmlContent); final List<TextNode> nodesToChange = new ArrayList<TextNode>(); NodeTraversor nd = new NodeTraversor(new NodeVisitor() { @Override public void tail(Node node, int depth) { if (node instanceof TextNode) { TextNode textNode = (TextNode) node; String text = textNode.getWholeText(); mat = pat.matcher(text); if(mat.find()) { nodesToChange.add(textNode); } } } @Override public void head(Node node, int depth) { } }); nd.traverse(doc.body()); for (TextNode textNode : nodesToChange) { Node newNode = buildElementForText(textNode); textNode.replaceWith(newNode); } return doc.toString(); }
Wrapping it all...
Here's my final class to wrap things up -
public class Highlighter { private String regex; private String htmlContent; Pattern pat; Matcher mat; public Highlighter(String searchString, String htmlString) { regex = buildRegexFromQuery(searchString); htmlContent = htmlString; pat = Pattern.compile(regex, Pattern.CASE_INSENSITIVE); } public String getHighlightedHtml() { Document doc = Jsoup.parse(htmlContent); final List<TextNode> nodesToChange = new ArrayList<TextNode>(); NodeTraversor nd = new NodeTraversor(new NodeVisitor() { @Override public void tail(Node node, int depth) { if (node instanceof TextNode) { TextNode textNode = (TextNode) node; String text = textNode.getWholeText(); mat = pat.matcher(text); if(mat.find()) { nodesToChange.add(textNode); } } } @Override public void head(Node node, int depth) { } }); nd.traverse(doc.body()); for (TextNode textNode : nodesToChange) { Node newNode = buildElementForText(textNode); textNode.replaceWith(newNode); } return doc.toString(); } private static String buildRegexFromQuery(String queryString) { String regex = ""; String queryToConvert = queryString; /* Clean up query */ queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " "); queryToConvert = queryToConvert.replaceAll("[\\s]*", " "); String[] regexArray = queryString.split(" "); regex = "("; for(int i = 0; i < regexArray.length - 1; i++) { String item = regexArray[i]; regex += "(\\b)" + item + "(\\b)|"; } regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))"; return regex; } private Node buildElementForText(TextNode textNode) { String text = textNode.getWholeText().trim(); ArrayList<MatchedWord> matchedWordSet = new ArrayList<MatchedWord>(); mat = pat.matcher(text); while(mat.find()) { matchedWordSet.add(new MatchedWord(mat.start(), mat.end())); } StringBuffer newText = new StringBuffer(text); for(int i = matchedWordSet.size() - 1; i >= 0; i-- ) { String wordToReplace = newText.substring(matchedWordSet.get(i).start, matchedWordSet.get(i).end); wordToReplace = "<b>" + wordToReplace+ "</b>"; newText = newText.replace(matchedWordSet.get(i).start, matchedWordSet.get(i).end, wordToReplace); } return new DataNode(newText.toString(), textNode.baseUri()); } class MatchedWord { public int start; public int end; public MatchedWord(int start, int end) { this.start = start; this.end = end; } } }
Don't forget to add the JSoup library to your build path. Download JSoup from here.
No comments:
Post a Comment