Detect HTML links

  • + 0 comments

    Java 15:

    import java.io.*;
    import java.util.*;
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
    public class Solution {
        public static void main(String[] args) {
            Scanner scanner = new Scanner(System.in);
            int n = Integer.parseInt(scanner.nextLine());
            String regex = "<a href\\s*=\\s*\"([^\"]+)\"[^>]*>(?:<[^<>]+>)*([^><]*)(?:<[^<>]+>)*";
            Pattern pattern = Pattern.compile(regex);
            for(int i=0;i<n;i++)
            {   Matcher matcher = pattern.matcher(scanner.nextLine());
                while(matcher.find())   
                {   System.out.print(matcher.group(1).trim());
                    System.out.println(","+matcher.group(2).trim());
                }
            }
        }
    }
    
    /*
    regex = "<a href\\s*=\\s*\"([^\"]+)\"[^>]*>(?:<[^<>]+>)*([^><]*)(?:<[^<>]+>)*"
    
    > Keeping <a is important as there is <area> tag with href attribute
    > \\s* means none or more space character.
    > \" used to denote double quotes
    > \"([^\"]+)\" means double quote + (anything other than double quote) + double quote. 
    > (anything other than double quote) forms the group(1) i.e. required URL. 
    > [^>]*> means (anything other than >) + >
        Before the <a> tag is closed there may be other attributes like title="---" etc. So to manage that this portion is used.
         
    > (?:___) means this is a non-captured group, i.e., this group is not added as next group, i.e., this is not group(2).
    > (?:<[^<>]+>) means a non-captured group where open_tag + (anything other than open and close tag) + close_tag that is a single tag like <h1>
    > (?:<[^<>]+>)* used to manage any starting tags nested within <a> tag, like <h1><p>.
    > ([^><]*) means anything other than open or close tag. Forms group(2) that is, required textname of url.
    > (?:<[^<>]+>)* used again to manage any closing tags nested within <a> tag, like </h1></p>.
    
    > So (?:<[^<>]+>)*([^><]*)(?:<[^<>]+>)* 
        means (any opening tags like <h1>)+(textname)+(any closing tags like </h1>)
    */