Monday, June 25, 2018

Regex to get MTOM binary content

Leave a Comment

I am trying to get the MTOM binary content using a extended class of SoapClient, the response is something like that:

    --uuid:8c73f23e-47d9-49fb-a61c-c1df7b19a306+id=2     Content-ID:      <http://tempuri.org/0>     Content-Transfer-Encoding: 8bit     Content-Type: application/xop+xml;charset=utf-8;type="text/xml"          <big-xml-here>             <xop:Include href="cid:http://tempuri.org/1/636644204289948690" xmlns:xop="http://www.w3.org/2004/08/xop/include"/>          </big-xml-here>  --uuid:8c73f23e-47d9-49fb-a61c-c1df7b19a306+id=2-- 

Right after the XML, the MTOM response continue with the binaries related to the "cid" URL:

Content-ID: <http://tempuri.org/1/636644204289948690> Content-Transfer-Encoding: binary Content-Type: application/octet-stream  %PDF-1.4 %���� (lots of binary content here)  --uuid:7329cfb8-46a4-40a8-b15b-39b7b0988b57+id=4-- 

To extract everything I've tried this code:

$xop_elements = null;         preg_match_all('/<xop[\s\S]*?\/>/', $response, $xop_elements);          $xop_elements = reset($xop_elements);          if (is_array($xop_elements) && count($xop_elements)) {              foreach ($xop_elements as $xop_element) {                  $cid = null;                 preg_match('/cid:(.*?)"/', $xop_element, $cid);                  if(isset($cid[1])){                     $cid = $cid[1];                     $binary = null;                     preg_match("/Content-ID:.*?$cid.*?(.*?)uuid/", $response, $binary);                     var_dump($binary);                     exit();                 }             }         } 

Although the preg_match_all and the first preg_match are working, the last one:

/Content-ID:.*?$cid.*?(.*?)uuid/  

is not working

On the original source: https://github.com/debuss/MTOMSoapClient/blob/master/MTOMSoapClient.php

the regex is

/Content-ID:[\s\S].+?'.$cid.'[\s\S].+?>([\s\S]*?)--uuid/ 

but I got an error on PHP 7:

preg_match(): Unknown modifier '/'

Is there a away to get MTOM binary of each CID?

Thanks in advance!

2 Answers

Answers 1

You need to first unquote $cid as that is causing the your first error

$cid = preg_quote($cid[1], '/'); 

Next you need to use the s modifier flag so that . matches new lines also

preg_match("/Content-ID:.*?$cid.*?(.*?)uuid/s", $response, $binary); 

s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

Answers 2

As I understand, you are trying to adjust the original code to your modified file SOAP version.

Instead of a number, you want to capture the whole http://tempuri.org/1/636644204289948690 in the $cid variable (you may want to rename the var). To do so you could use the following regex, that matches everything but a double quote in capture group 1: cid:([^"]+)

preg_match('/cid:([^"]+)/', $xop_element, $cid); 

So far, so good. Guessing from your description you should use the following pattern to capture the binary part:

'%Content-ID: <'.$cid.'>([\s\S]*?)--uuid%' 

We use a modified dot [\s\S] to match across multiple lines (as shown as well in the original implementation). Otherwise, add the s|single line flag or (?s) inline modifier. Also, I use alternative regex delimiters % to avoid escaping problems. It's still sound to use preg_quote($cid[1], '%') as suggested by Tarun.

Demo

Now, you can retrieve the block in question from capture group 1:

trim($binary[1]); 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment