Wednesday, December 28, 2011

The java way of highlighting an HTML content with JSoup and Regex (works for android too)

When I wanted to do this the first time, I tried to search on Stack Overflow and other Q&A forums but in vain. There are very few places where this topic has been discussed and none really concluded it. So here's some solution I formulated with parts of it flicked from various places.

What I wanted to do was: fetch html from a web page into my java application; get a search string from the application user; build a regex out of search string and highlight html content that matched the regex. This is quite easy to do this with a javascript but my use case required this to be done in the java code.

Basically I required this piece of code around the text that should be highlighted.


<span style="background-color:yellow"> Text that matched regex </span> 


Building the regex from search string

Suppose my search string was 'hello world' then my regex would be (hello)|(world*).

Here's the function that will do it.

private static String buildRegexFromQuery(String queryString) {
        String regex = "";
        String queryToConvert = queryString;

        /* Clean up query */

        queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " ");
        queryToConvert = queryToConvert.replaceAll("[\\s]*", " ");

        String[] regexArray = queryString.split(" ");

        regex = "(";
        for(int i = 0; i < regexArray.length - 1; i++) {
            String item = regexArray[i];
            regex += "(\\b)" + item + "(\\b)|";
        }

        regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))";
        return regex;
    }

Searching and replacing proper html content

I cannot just do a simple String.replaceAll() because that would mess the whole thing when matched text is inside the tags. So I was pretty sure I required a html parser. Then I met the beautiful. JSoup.

JSoup made things quite easy to me. I just had to traverse the HTML DOM tree, retrieve text and make changes. With the NodeTraversor and NodeVisitor class it is a piece of cake to traverse the html content.

     public String getHighlightedHtml() {
    
     Document doc = Jsoup.parse(htmlContent);
    
     final List<TextNode> nodesToChange = new ArrayList<TextNode>();
    
     NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {
    
      @Override
      public void tail(Node node, int depth) {
       if (node instanceof TextNode) {
        TextNode textNode = (TextNode) node;
        String text = textNode.getWholeText();
         
        mat = pat.matcher(text);
        
        if(mat.find()) {
         nodesToChange.add(textNode);
        }
       }
      }
    
      @Override
      public void head(Node node, int depth) {        
      }
     });
    
     nd.traverse(doc.body());
    
     for (TextNode textNode : nodesToChange) {
      Node newNode = buildElementForText(textNode);
      textNode.replaceWith(newNode);
     }
     return doc.toString();
    }

Wrapping it all...

Here's my final class to wrap things up -

public class Highlighter {
    
   private String regex;
     private String htmlContent;
     Pattern pat;
     Matcher mat;
    
    
     public Highlighter(String searchString, String htmlString) {
      regex = buildRegexFromQuery(searchString);
      htmlContent = htmlString;
      pat = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
     }
    
     public String getHighlightedHtml() {
    
      Document doc = Jsoup.parse(htmlContent);
    
      final List<TextNode> nodesToChange = new ArrayList<TextNode>();
    
      NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {
    
       @Override
       public void tail(Node node, int depth) {
        if (node instanceof TextNode) {
         TextNode textNode = (TextNode) node;
         String text = textNode.getWholeText();
         
         mat = pat.matcher(text);
         
         if(mat.find()) {
          nodesToChange.add(textNode);
         }
        }
       }
    
       @Override
       public void head(Node node, int depth) {        
       }
      });
    
      nd.traverse(doc.body());
    
      for (TextNode textNode : nodesToChange) {
       Node newNode = buildElementForText(textNode);
       textNode.replaceWith(newNode);
      }
      return doc.toString();
     }
    
     private static String buildRegexFromQuery(String queryString) {
      String regex = "";
      String queryToConvert = queryString;
      
      /* Clean up query */
      
      queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " ");
      queryToConvert = queryToConvert.replaceAll("[\\s]*", " ");
      
      String[] regexArray = queryString.split(" ");
      
      regex = "(";
      for(int i = 0; i < regexArray.length - 1; i++) {
       String item = regexArray[i];
       regex += "(\\b)" + item + "(\\b)|";
      }
    
      regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))";
      return regex;
     }
    
     private Node buildElementForText(TextNode textNode) {
      String text = textNode.getWholeText().trim();
      
      ArrayList<MatchedWord> matchedWordSet = new ArrayList<MatchedWord>();
      
      mat = pat.matcher(text);
      
      while(mat.find()) {
       matchedWordSet.add(new MatchedWord(mat.start(), mat.end()));
      }
      
      StringBuffer newText = new StringBuffer(text);
    
      for(int i = matchedWordSet.size() - 1; i >= 0; i-- ) {
       String wordToReplace = newText.substring(matchedWordSet.get(i).start, matchedWordSet.get(i).end);
       wordToReplace = "<b>" + wordToReplace+ "</b>";
       newText = newText.replace(matchedWordSet.get(i).start, matchedWordSet.get(i).end, wordToReplace);  
      }
      return new DataNode(newText.toString(), textNode.baseUri());
     }
     
     class MatchedWord {
      public int start;
      public int end;
      
      public MatchedWord(int start, int end) {
       this.start = start;
       this.end = end;
      }
     }
    }


Don't forget to add the JSoup library to your build path. Download JSoup from here

Tuesday, December 20, 2011

What did people google in the last year?

The Zeitgeist 2011 report is released and the top 10's are all out there making a buzz on twitter and blogs. For those who don't know what a Zeitgeist report is, it is a report that Google releases at the end of every year to show the top searches of the whole year in different categories.

Rebecca Black tops the list of fastest-rising global queries and amazingly Google+ has taken the second position. Had it not been Google pushing Google+ onto their users, Google+ would have probably not made it to the top but anyway here it is now. Steve Jobs is at position nine and most of what contributed for his search queries came only after he died. But with iPhone5 (the biggest disappointment of 2012) at position 6 and iPad2 at position 10, apple did rock the list of top 10 this year.

MySpace has topped the list of 'Fastest falling' search queries and surprisingly Orkut is no where in the top 10 which quite amazes me. May be it is just no in the league anymore.

Coming to the consumer electronic section, Kindle Fire happens to be the most searched gadget of the year. With all the hype amazon has made, this definitely deserved to be there but on a related note : Kindle fire is no where in league with iPad. iPhone4S is in second position.

Now to see through the top 10's of India, Facebook is in first place and G+ is 3rd. Looks like Indians were busy socializing this year. Three movies in top 10 fastest rising searches is quite a lot. Bodyguard at number 5, Ra.One at number 6 and Ready at number 10. Indians don't have anything else to search? Or am I just being paranoid.

World cup this year is at number 4. Poonam Pandey, one of the reason why a part of Indian population wanted  India to win world cup, is at number 9. India won the world cup, people still wonder where is poonam pandey?

Anna Hazare takes the lead in fastest rising people. With all the friction going on in India, he definitely is the most searched upon person. Poonam Pandey is again second and I can understand why. Third comes Steve Jobs and rest of the list is populated with celebrities like Salman Khan, Anushka Sharma, Kajal Aggarwal, Vijay Mallya and etc.

Katrina Kaif is the most searched upon person in 'People'section. Sachin Tendulkar takes 7th position. Bodyguard is the most searched movie and followed by RaOne and Harry Potter in 2nd and 3rd position.

So this is the Zeitgeist 2011 list and you can find the complete list on the official website

Friday, December 2, 2011

Get the new Google Bar

Google recently went on a renovation spree to change the appearances of all the google services. Ever since the release of G+, google started rolling out new appearances to every other service they had.



Recently Google announced in its official google blog that google is now ready for its next set of change - the new Google bar which makes it easy to access different Google Services. Even though the new google bar is announced, it is still not rolled out for users but here is a simple tweak that help you get a head start in trying out the new google bar.

The tweak is to modify your cookie. All you have to do is follow these simple steps -

For Google Chrome users -

1. Add 'Edit this Cookie' extension to Chrome
2. Go to google.com
3. Click on the extension

4. Go to PREF section and change the value field to this
ID=03fd476a699d6487:U=88e8716486ff1e5d:FF=0:LD=en:CR=2:TM=1322688084:LM=1322688085:S=McEsyvcXKMiVfGds
 5. Refresh the page


For firefox users - 


1. Go to google.com
2. Press Ctrl + Shift + K
3. Paste the following into the text field and press enter
document.cookie="PREF=ID=03fd476a699d6487:U=88e8716486ff1e5d:FF=0:LD=en:CR=2:TM=1322688084:LM=1322688085:S=McEsyvcXKMiVfGds; path=/; domain=.google.com";window.location.reload();

For IE users -

Install Chrome, firefox or anything that evolved after dinosaurs! 

Thursday, September 1, 2011

Ubuntu Transformation Pack for Windows XP, Vista and 7

Transformation pack allows you to change the look and style of your operating system. It changes the look and feel of windows without installing the new operating system. Some pack also gives additional functionality according to operating system.



There are number of ways in which one can transform Windows look into Ubuntu’s, viz. using a theme or by using a transformation pack. Here are some links where one can find themes as well as transformation packs for Windows XP, Vista, 7 to transform the look and feel.



Here is the Link for downloading UBUNTU theme:

Download Ubuntu transformation pack for Windows 7

http://fc01.deviantart.net/fs70/f/2011/122/e/3/ubuntu_skin_pack_4_0_for_win_7_by_hameddanger-d3ff3od.zip

http://www.softpedia.com/get/System/OS-Enhancements/Ubuntu-Skin-Pack.shtml

Download Ubuntu transformation pack for Windows XP


http://www.winmatrix.com/forums/index.php?/topic/24453-ubuntu-transformation-pack-for-windows-xp/

http://techpp.com/2009/02/13/ultimate-collection-ubuntu-themes-visual-styles-for-windows-xp-vista/

Download Ubuntu transformation pack for Windows Vista

http://hydrattz.deviantart.com/art/Ubuntu-Transformation-Pack-88978320



Precautions to take before using Transformation Packs

  1. Try to use transformation pack which just changes look and feel rather than one introducing new features as some irreversible change may occur to your system files causing system not to work properly.
  2. Read all the instructions carefully as these steps may introduce some unwanted change to your computer.

Ubuntu Transformation Pack Ubuntu Transformation Pack for Windows XP, Vista and 7

If the care taken, one can enjoy different look and feel every other month without harassing one’s system.