Tuesday, April 3, 2018

Regular expression to select a particular content, provided it is not enclosed in comments

Leave a Comment

I am looking for a regular expression which matches the pattern src="*.js", but this should not be enclosed in a comment.

consider the following

<!------<script type="text/javascript" src="js/Shop.js"></script>  --> <!----<script type="text/javascript" src="js/Shop.js"></script>  --> <script type="text/javascript" src="jquery.serialize-object.js"></script> <script type="text/javascript" src="jquery.cookie.js"></script> 

Extended sample input, described by OP as "correct":

<!------<script type="text/javascript" src="js/Shop.js"></script>  --> <!----<script type="text/javascript" src="js/Shop.js"></script>  --> <script type="text/javascript" src="jquery.serialize-object.js"></script><!----> <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards --> <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending -- afterwards --> <script type="text/javascript" src="jquery.serialize-object.js"></script> <script type="text/javascript" src="jquery.cookie.js"></script> 

The result should not match line 1 and 2 (where the content is enclosed with comment). It should only match line 3 and 4 (3-end, except comment-end line, for extended sample input).

So far I have this regexp which selects all my .js files but also the ones that are commented out: (src=\")+(\S)+(.js)

I am looking for a regex which only selects the script tags with a .js src attribute that are not surrounded by a comment.

I would also like to mention that I am using this regular expression in an Oracle PL SQL query.

6 Answers

Answers 1

I don't know if you can do what you want with a single regular expression, especially since Oracle's implementation of regular expressions does not support lookaround. But there are some things you can do with SQL to get around these limitations. The following will extract the matches for the pattern, first by removing comments from the text, then by matching the patter src=".*\.js" in what remains. Multiple results are retrieved using CONNECT BY:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match   FROM (     SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html       FROM (         SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script>  -->         <!----<script type="text/javascript" src="js/Shop.js"></script>  -->         <script type="text/javascript" src="jquery.serialize-object.js"></script><!---->         <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->         <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending         -- afterwards -->         <script type="text/javascript" src="jquery.serialize-object.js"></script>         <script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text           FROM dual     ) ) CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL    AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL; 

If these results are stored in a table somewhere, then you would do the following:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match   FROM (     SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html       FROM mytable ) CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL    AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL; 

It seems strange but the final line is necessary to avoid duplicate results.

Results as follows:

| HTML_ID | MATCH                              | +---------+------------------------------------+ |       1 | src="jquery.serialize-object.js"   | |       1 | src="jquery.serialize-object.js"   | |       1 | src="jquery.serialize-object.js"   | |       1 | src="jquery.serialize-object.js"   | |       1 | src="jquery.cookie.js"             | +---------+------------------------------------+ 

SQL Fiddle HERE.

Hope this helps.

EDIT: Edited according to my comment below:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match   FROM (     SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html       FROM (         SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script>  -->         <!----<script type="text/javascript" src="js/Shop.js"></script>  -->         <script type="text/javascript" src="jquery.serialize-object.js"></script><!---->         <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->         <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending         -- afterwards -->         <script type="text/javascript" src="jquery.serialize-object.js"></script>         <script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text           FROM dual     ) ) CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL    AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL; 

Answers 2

For e.g. this sample input:

<!------<script type="text/javascript" src="js/Shop.js"></script>  --> <!----<script type="text/javascript" src="js/Shop.js"></script>  --> <script type="text/javascript" src="jquery.serialize-object.js"></script><!----> <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards --> <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending -- afterwards --> <script type="text/javascript" src="jquery.serialize-object.js"></script> <script type="text/javascript" src="jquery.cookie.js"></script> 

This regex: src="[^"]*\.js\"></script>(\s*<!--[^>]*-->)*(\s*<!--[^>]*)?$
will give you this output:

<script type="text/javascript" src="jquery.serialize-object.js"></script><!----> <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards --> <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending <script type="text/javascript" src="jquery.serialize-object.js"></script> <script type="text/javascript" src="jquery.cookie.js"></script> 

I tested with GNU grep 2.5.4, hoping that it gets close enough to your regex flavor. The regex is very light on special features.

Explanation:

  • \"[^"]* is "anything within " "
  • (<!--[^>]*-->)* is "any number of complete comments, if they do not contain > "
  • (<!--[^>]*)?$ is "an optional start of a non-> comment at the end of a line"
  • \s* allowing optional white space

Note, at some point of possible complexity of relevant input, regexes stop being the right tool. Beyond, a dedicated tool, i.e. a parser for XML/html whatever is the choice.
For me that point is reached when the possibility occurs of the relevant input being "hidden" inside a multiline comment. I feel that you turned the question into a moving target, by first confirming that expecting relevant input on one line is allowed (apart from a comment starting afterwards) but then changed the rules, by adding contradicting sample input. At one point you did describe the sample input I proposed as "correct".
The (very funny) XML/regex discussing QA linked in the comments demonstrates the hell you can end up in, if you do not draw the line early enough.
When restricted into a given environment, e.g. SQL server, the special abilities of that environment should be leveraged. Surely processing the non-commented parts of the input by SQL mechanisms to achieve a some steps further goal is possible. I.e. drop your immediate idea of how to proceed and take a little detour in thinking. Try to make sure that you do not exhaust yourself on a XY-problem.

Answers 3

I've put a negative look-ahead before the end of your regex, but mind that if there's a commented part after the src it will likewise be ignored.

(src=\")+(\S)+(\.js\")+(?!.*-->)(.*) 

Edit:

I managed something similiar without the lookahead (which PL/SQL doesn't have):

(src=\")(\S)+(\.js\")[^(--)\n]+(\n|$) 

Answers 4

Here is my solution : one simple negative lookbehind.

(?<!<!--.+)src=".+\.js"

This matches all the src attributes in your extended example, but not those preceded by <!--. It might just be enough, tell me if I missed some specific cases ;)

Here is my solution running on your extended example : https://regex101.com/r/rmHkbm/1

EDIT : This is working in javascript, I don't know for ORACLE PL/SQL. Is there any way to test it without installing an Oracle database ?

Answers 5

I don't think it's possible to do what you want using a single regular expression without negative lookaround. But, you can do it by logically combining two similar regular expressions in a way that's easy to do in SQL. The basic idea is:

[MATCH_EXPR] AND NOT [COMMENTED_MATCH_EXPR] 

Assume we have a table data with a column line (lines of code), we could select the lines of interest with something like:

SELECT line FROM data WHERE REGEXP_LIKE(line, 'src="[^"]+.js"') AND NOT REGEX_LIKE(line, '<!--.*src="[^"]+.js"'); 

You can update the regular expressions to be more precise and/or do something more sophisticated with them, e.g. capture the file names, but the approach would be the same.

This approach is not bulletproof in that it would fail to find lines that consist of two <script> statements where only the second one is commented out, since the second regular expression would match. Nevertheless, it would likely work for the vast majority of typical code, including the examples given above.

Answers 6

I have tried the below on https://livesql.oracle.com, so probably will work for you. assuming an uncommented line starts with '<script'. It matches the lines like

    <script type="text/javascript" src="jquery.cookie.js"></script>     <script type="text/javascript" src="jquery.serialize-object.js"/>     <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards --> 

query with regular expressions:

select "SRC" from "TABLE_1"  where REGEXP_LIKE (SRC, '^\<script.+\.js.+script\>$')     or REGEXP_LIKE (SRC, '^\<script.+\.js.+script\>\<\!\-\-.+\-\-\>$')     or REGEXP_LIKE (SRC, '^\<script.+\.js.+\/\>$'); 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment