Using mod_sed to filter web content in Apache

Using mod_sed to filter Web Content in apache mod_sed is a apache module which filters the web content using powerful sed commands whether is generated by php, jsp or a plain html. Basic configuration information can been seen from the README. In this blog, I will cover how cryptic but powerful sed commands can be used inside apache.

Using branches "b" to implement if/else type of code
Suppose I want to write
if (line contains "a") then
   replace "x" with "y"
else
   replace "y" with "x"
fi
If I want to write above logic using "goto" syntax then I can write something like (pseudo code ) :
if (line contains "a") go to :ifpart
# else part
   replace "y" with "x"
   go to :end
:ifpart
   replace "x" with "y"
:end
In sed we can use the branch command "b" which is equivalent of goto. Here is the sed equivalent code :
/a/ b ifpart
s/y/x/g
b end
:ifpart
s/x/y/g
:end

$ cat one.txt
ax
xyz
$ /usr/ucb/sed -f one.sed < one.txt
ay
xxz
We can write the same example in apache :
OutputSed "/a/ b ifpart"
OutputSed "s/y/x/g"
OutputSed "b end"
OutputSed ":ifpart"
OutputSed "s/x/y/g"
OutputSed ":end"


Using hold buffer "h" as a buffer to save current text
Let's say I have a text :
It is Sunday today.
And I want replace it with two lines :
It is Monday today.
It is Sunday today.
So I want to do the following (pseudo code)
saveline=curline
replace Monday with Sunday.
curline = curline + saveline
print curline
In sed, we will write something like :
# hold the buffer
h
s/Sunday/Monday/
# Append the hold buffer to current text.
G
Sed's G command append the hold buffer into the current line (Pattern space). Inside apache, we can do the same thing using OutputSed directives :
OutputSed "h"
OutputSed "s/Sunday/Monday/"
OutputSed "G"


Multiline expression using hold buffer and commands "N", "x", "h" and "H"
Sed is very powerful to handle multi line text manipulation. Suppose, I have a condition which says :
'If a line contain "Sunday" and next line contain "Monday" then replace "Sunday" in first line to "Monday" and replace "Monday" to "Tuesday" in second line.'
As a example, I have a text :
It is Sunday today.
Tomorrow will be Monday.
The output should look like :
It is Monday today.
Tomorrow will be Tuesday.
So I want to do the following (pseudo code)
search for Sunday in current line
if found then 
    saveline=curline
    Read next line into curline
    search for Tuesday in second line
    if found then 
        swap curline and readline
        replace Sunday to Monday in curline
        swap curline and readline again.
        replace Monday to Tuesday in curline
        saveline = saveline + curline
        curline = saveline
    end innerif
end outerif
Next line can be read by "N" command.
swap functionality is provided by "x" sed command.
Appending saveline with curline is provided by "H" command.
replacing "curline" with "saveline" is provided by "g" command.
Overall sed script will look like :
/Sunday/ {
# save the current line in hold buffer
h
# Delete the content of the current line.
s/.\*//
# Read next line.
N
# Delete first new line character (from previous line)
s/\^.//
# Search for Monday in next line.
    /Monday/ {
# Exchange hold buffer from current line
        x
# Now current line contain 1st line so replace Sunday with Monday.
        s/Sunday/Monday/
# Exchange hold buffer from current line
        x
# Now current line contain 2nd line so replace Monday with Tuesday.
        s/Monday/Tuesday/
# Append hold buffer (1st line) with 2nd line.
        H
# Replace hold buffer with current line
        g
    }
}
Inside apache httpd.conf, I will write the equivalent sed script as following :
OutputSed "/Sunday/ {"
OutputSed "h"
OutputSed "s/.\*//"
OutputSed "N"
OutputSed "s/\^.//"
OutputSed     "/Monday/ {"
OutputSed         "x"
OutputSed         "s/Sunday/Monday/"
OutputSed         "x"
OutputSed         "s/Monday/Tuesday/"
OutputSed         "H"
OutputSed         "g"
OutputSed     "}"
OutputSed "}"
Above example shows how powerful sed commands can be used to filter web content (whether it is generated by html or php or jsp). Details of the sed can be obtained from sed man page
Comments:

Great, but how the heck do I get mod_sed? I haven't been able to find it for download anywhere, including your blog.

Posted by Benjamin Weiss on February 03, 2009 at 02:41 AM PST #

mod_sed is part of apache trunk.

If you are not using apache from trunk then
you can compile it for apache 2.2. It perfectly works with apache 2.2. Checkout the trunk :

Take the following 4 files from trunk (modules/filters directory) :
mod_sed.c sed0.c sed1.c regexp.c

And compile for apache 2.2 using apxs
apxs -c mod_sed.c sed0.c sed1.c regexp.c

Posted by Basant Kukreja on February 03, 2009 at 02:59 AM PST #

Here is the url for these files :
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/

Posted by Basant Kukreja on February 03, 2009 at 03:01 AM PST #

I just got a few problems with that installation of mod_sed on my CPanel Server with Apache 2.2

apxs -c mod_sed.c sed0.c sed1.c regexp.c always gives me an error that the files are not found, so where do I have to copy the 4 files?

Hope you could help with that problem.

Posted by mike on February 27, 2009 at 05:11 PM PST #

I have already written above that these are part of http trunk.
You can get these files from :
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/

Posted by Basant Kukreja on February 28, 2009 at 09:24 AM PST #

I did the following, I installed the RPM from http://www.atomicorp.com/channels/atomic/centos/5/x86_64/RPMS/ and then tried to work with the apxs, but it always returned me that the files are not found.

I am relative new to linux so would be great if you could explain it a bit more easier for me.

Posted by mike on February 28, 2009 at 06:47 PM PST #

All right, I solved the problem after recompiling my apache once more and start from the beginning.

Now I have one more final question. I tried to use regular expression, but they are not excepted. So I used the following command with mod_substitute where it worked:

Substitute 's|<body?(.\*[\^>])>|$0MY OWN CODE|g'

So I added my own code after any type of <body>, but with mod_sed nothing happens, have I missed something?

Posted by mike on February 28, 2009 at 10:55 PM PST #

Found one more question.

I just wanted to use the sed filter for all file types. So in mod_substitute the following works:

AddOutputFilterByType SUBSTITUTE text/html

So that command works for html and php and so on, but mod_sed is not accepting that. So for mod_sed, every file ending has to be added like that ?

AddOutputFilter Sed php php4 php5 html .......

Is there a way to use it for all processed files?

Posted by mike on March 01, 2009 at 01:07 AM PST #

AddOutputFilterByType Sed text/html

Posted by try this on March 01, 2009 at 10:02 PM PST #

Hi,

mod_sed seems to be exactly what I need, but I'm having a hard time getting it set up. Specifically, when I try to compile, I get a whole screenful of errors that looks like:

apxs -c mod_sed.c regexp.c sed0.c sed1.c
/usr/lib/apr-1/build/libtool --silent --mode=compile gcc -prefer-pic -O2 -g -march=i386 -mcpu=i686 -DLINUX=2 -D_REENTRANT -D_GNU_SOURCE -D_LARGEFILE64_SOURCE -pthread -I/usr/include/httpd -I/usr/include/apr-1 -I/usr/include/apr-1 -I/usr/include/mysql -c -o mod_sed.lo mod_sed.c && touch mod_sed.slo
`-mcpu=' is deprecated. Use `-mtune=' or '-march=' instead.
mod_sed.c:1: error: syntax error before '<' token
mod_sed.c:19:29: warning: character constant too long for its type
mod_sed.c:20:27: warning: character constant too long for its type
mod_sed.c:21:36: warning: character constant too long for its type
mod_sed.c:22:27: warning: character constant too long for its type
mod_sed.c:29: error: stray '#' in program
...
mod_sed.c:235: error: syntax error before '<' token
apxs:Error: Command failed with rc=65536

Any idea why this is happening? I'm running httpd 2.2.8 on Fedora Core 4. Thanks for your help!

-Dan Delany

Posted by Dan Delany on March 03, 2009 at 05:58 AM PST #

I've solved the problem I mentioned above, here's what was happening in case anyone else runs into it... I was getting my code from http://src.opensolaris.org/source/xref/webstack/mod_sed/ with wget (eg. wget http://src.opensolaris.org/source/xref/webstack/mod_sed/sed0.c). This does NOT work, as this is not actually a C file, but a generated HTML file...

Once I grabbed the files from http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/ I was able to compile correctly. One caveat to the instructions above: In addition to the source files mentioned (mod_sed.c sed0.c sed1.c regexp.c), I also needed the header files to compile (libsed.h sed.h regexp.h). Once I got those files, the .so file was correctly created in the .libs directory. Thanks for a great module!

-Dan

Posted by Dan Delany on March 03, 2009 at 06:16 AM PST #

Very usefull....
I am having some trouble replacing characters in an input filter by their HEX code , can anyone provide the syntax please?

G.

Posted by G. on April 04, 2009 at 12:35 AM PDT #

mod_sed is now integrated into opensolaris. Users can download mod_sed from :
http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/apache2/modules/mod_sed.tar.gz

Posted by Basant on May 27, 2009 at 09:29 AM PDT #

You can also checkout the mod_sed from the bitbucket repository :
hg clone http://bitbucket.org/basantk/mod_sed/

Posted by Basant Kukreja on September 16, 2009 at 08:33 AM PDT #

Fixed mod_sed bug 48024
https://issues.apache.org/bugzilla/show_bug.cgi?id=48024#c3

(Also available from bitbucket repository)

Posted by Basant Kukreja on October 19, 2009 at 06:59 AM PDT #

1) There is a bug, but after checking the code you seem to be aware of it. If a stream does not end in a NEWLINE, one is added at the end. This is wrong for POSTs, downloaded files are being modified (even if no paterns are found). This new line may (or may not) be a good idea on the command line, but as an apache filter it is wrong.

Commeting out line Line 335 in mod_sed.c will fix this without side effects?
APR_BRIGADE_INSERT_TAIL(ctx->bb, b);

2) Question: Is there any way to quit processing the stream but keep passing it as-is? I use a conditional, if found it should do no more substitures. This is my code:

:filter
/my_condition/! {
s/xxx/yyy/g
N
b filter
}
:keep
N
b keep

The :keep loop is just "skip line-by-line till the end of file" but this looks quiet inefficient.

a "q" instead of :keep quits altogether and everything after my_condition is swallowed (instead of being by passed)

Even more, If I could quit and keep all what comes afterwards, then I could avoid the first loop and multi-line altogether.

Any ideas?

Posted by SGA on February 06, 2011 at 06:27 PM PST #

Answer to Q1 : I agree something need to be done to make it more suitable for HTTP processing.
I can't say that there won't be any side effect with your changes.

Answer to Q2 :
I am sure there will be many alternative to do above why not use
if-else approach like (not tested):

/my_condition/ ! {
s/xxx/yyy/g;
p
}
/my_condition/ {
p
}

Or simply

/my_condition/ ! {
s/xxx/yyy/g;
}
p

-----------------

Is branches really necessary?

Posted by Basant Kukreja on February 06, 2011 at 10:45 PM PST #

Regarding question 2:

One stream might need to be parsed until certain point only, and the following lines remain as they are (in my case they are binary content). Branching is the only way to "remember" what happened x lines ago. I would not need to remember anything if I could simply quit and keep all the remaining stream.

This coming code performs really fast, but I thought there might have been some way to bypass the loop:

OutputSed "/binary_content_starts/ b keep"
OutputSed "s/xxx/yyy/g"
OutputSed "b"
OutputSed ":keep"
OutputSed "N"
OutputSed "b keep"

Regarding question 1:
I think I have wrongly identified were the extra NEWLINE is added. Is that in your code or in the original sed code? I fail to find it.

Issue 3:
I have tried to use env. variables, but I have found they do not expand. I see no way to pass mod_sed external information (feature for the future). In my case that means loading 10000 regular expressions (I have the right one in the environment + header + cookie) Unfortunately I will have to find to tell sed to accept longer commands since this is too big, I get this error around expression #500

Failed to compile sed expression. too much command text: s/yyy/ooo/g

I tried a workaround witn a long line, but this works only in sed and not in mod_sed

s/\\(txt1\\|txt2\\)xxx/\\1yyy/g

The problem is the OR (with or without backslash) I get no match. This however works as expected in mod_sed and sed:

s/\\(txt1\\)xxx/\\1yyy/g

Posted by SGA on February 06, 2011 at 11:58 PM PST #

1) fund how to enlarge the regular expresion lenth

in sed.h
#define RESIZE 10000

2) after investing way too many hours I still failed to find where the NEWLINE is appended at the end of the stream. I need to fix that. Any clue here?

Posted by SGA on February 08, 2011 at 12:13 AM PST #

Hi,

I have installed mod_sed on apache 2.2.10 installed on OpenSuse10.2 i386 platform .I was sucessfully able to install the module on apache and its .so file was visible on the apache server but some how its not changing any contents of the post request.I have used the following commands in the httpd.conf file
"<Directory "testproj/index.html">
AddInputFilter Sed html
InputSed "c/S/FA/g"
</Directory>

please suggest what could be the possible reason

Posted by utkarsh jain on February 08, 2011 at 02:33 PM PST #

AddInputFilter Sed html
InputSed "c/S/FA/g"

I think you meant :
InputSed "s/S/FA/g"

If above is not the case, then just attach the debugger to httpd and make sure apache is invoking sed_request_filter function.
If it is not something wrong in your configuration.

Posted by Basant Kukreja on February 08, 2011 at 02:46 PM PST #

> Is that in your code or in the original sed code?
Logically I didn't change sed logic. It should behave identical to sed. Code is derived from opensolaris sed command.

Posted by Basant Kurkeja on February 08, 2011 at 02:48 PM PST #

> I tried a workaround witn a long line, but this works only in sed and not in mod_sed

>s/\\(txt1\\|txt2\\)xxx/\\1yyy/g

To make it clear, sed uses BRE (basic regular expression).

On linux sed is really a gsed (GNU sed) which uses other extensions. Original sed didn't have "|".

Here is the small experiment :

Linux (gsed) :

$ uname
Linux
[~] $ echo "txt1xxx" | sed -e 's/\\(txt1\\|txt2\\)xxx/\\1yyy/g'
txt1yyy

Solaris :
$ uname
SunOS
[~] $ echo "txt1xxx" | sed -e 's/\\(txt1\\|txt2\\)xxx/\\1yyy/g'
txt1xxx
[~] $ echo "txt1xxx" | /opt/csw/bin/gsed -e 's/\\(txt1\\|txt2\\)xxx/\\1yyy/g'
txt1yyy
-----------------------------------

Difference between BRE and RE and ERE :
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03
---------------------------------

BRE doesn't have OR operator support.

Posted by guest on February 08, 2011 at 03:01 PM PST #

> One stream might need to be parsed until certain point only, and the following lines remain as they are

Not sure if it will work but try experimenting with hold buffer to see if you can simulate hold buffer to save a variable state.

Posted by Basant Kukreja on February 08, 2011 at 03:16 PM PST #

>after investing way too many hours I still failed to find where the NEWLINE

Debug the code in sed1.c (sed_finalize_eval). This function will be called for the last line.

If you could not figure out then I will debug and let you know later (in few days).

Posted by Basant Kukreja on February 08, 2011 at 03:18 PM PST #

> One stream might need to be parsed until certain point only, and the following lines remain as they are (in my case they are binary content). Branching is the only way to "remember" what happened x lines ago. I would not need to remember anything if I could simply quit and keep all the remaining stream.

Above is not true, you can use hold buffer smartly to achieve what you want to do. Here is my experiment.

[/tmp] $ cat one.txt
abc
xyz
abc
abc
abc
[/tmp] $ cat one.sed
/abc/ {
x
/1/! {
# Replace every thing to 1.
s/.\*/1/
x
# hold buffer will now contain "1"
# Now contains original line
s/abc/cde/
}
/1/ {
x
}
}

[/tmp] $ cat /tmp/one.txt | sed -f one.sed
cde
xyz
abc
abc
abc
---------------------------

Note that sed only replaces abc to cde first time, after that it leaves rest of the things alone.
This happened because I saved "1" in hold buffer which was used as a conditional variable.

>fund how to enlarge the regular expresion lenth

I think you are going in wrong direction. There must be better ways to achieve what you want to do.

Posted by Basant Kukreja on February 08, 2011 at 03:41 PM PST #

Can you please tell how we attach a debugger in apache.Can you please suggest some good open source debugger

Posted by utkarsh jain on February 08, 2011 at 06:33 PM PST #

1) Good ideas! Will implement them and benchmark.

But there seems to be a bug for the x command:

OutputSed "x"
OutputSed "x"

The above code should work as a bypass, but it just replaces all lines by newlines. With unnaltered compilation of: mod_sed-983d603b3029.tar

I think the hold buffer is not getting the current line (h works, but of course I loose my buffer).

2) I think the problem is still beyond sed_finalize_eval, maybe by execute, but still not found, and with little hope now.

3) bigger RESIZE (my special case, default should be enough for most cases)

I need to alter the domain of one email address per request, from a whitelisted email addresses list.

I know the email address for each request (is in a cookie, and post headers). But I have no way to let mod_sed know (it cannot read environment), so the only solution is to load the full list (instead of an ideal %{PATTERN}), and match it against every single line. It still performs fast, but RESIZE was limiting my list size.

Thanks!

Posted by SGA on February 08, 2011 at 10:18 PM PST #

Comment to utkarsh :
http://httpd.apache.org/dev/debugging.html

Posted by Basant Kukreja on February 09, 2011 at 08:09 AM PST #

> But there seems to be a bug for the x command:

I will check.

Posted by Basant Kukreja on February 09, 2011 at 08:11 AM PST #

hi,
Actually we have been able to modify the post request contents but we are not able to change the request length .Can you please guide us how to do that

Posted by utkarsh jain on March 16, 2011 at 10:06 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

Basant Kukreja

Search

Top Tags
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today