Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing serialization corrupts some data #20

Open
tcavalli opened this issue Nov 1, 2013 · 7 comments
Open

Fixing serialization corrupts some data #20

tcavalli opened this issue Nov 1, 2013 · 7 comments

Comments

@tcavalli
Copy link
Contributor

tcavalli commented Nov 1, 2013

After lib/tasks.php replaces URLs in the database dump, the next line of code uses a regular expression in order to fix up the length of serialized strings. I contributed to this line of code, but only to fix deprecated code so I knew where to look the other day when I noticed corrupted serialized data.

The problem is that plugins (at least) store data in wordpress's options table (and possibly elsewhere) using, of course, add_option or update_option, which, as documented, serialize the option value. So with no a priori knowledge of the plugins that will be used nor the data they might try to store, a problem emerges.

Consider the abbreviated INSERT statement below from a database dump:

INSERT INTO `wp_options` VALUES (190,'iphorm_active_uniform_themes','s:78:\"a:4:{i:1;s:7:\"default\";i:2;s:7:\"default\";i:3;s:7:\"default\";i:4;s:7:\"default\";}\";','yes');

This plugin must've done something equivalent to:

// Probably not explicitly set like this...
$data = array('default', 'default', 'default', 'default');

$active_uniform_themes = serialize($data);
update_option('iphorm_active_uniform_themes', $active_uniform_themes);
// or even just update_options('iphorm_active_uniform_themes', serialize($data));

Granted, if the case is anything close to as simple as above, the plugin's code should really be fixed. Wordpress's documentation clearly says:

You do not need to serialize values. If the value needs to be
serialized, then it will be serialized before it is inserted into the
database.

But it's not hard to imagine that a plugin legitimately needs to store some object or array (of objects) which themselves may have serialized data (maybe not in wordpress's option table, but it doesn't matter where as long as it's in the same database).

Anyway, the regular expression matches the following portion of the SQL statement:

s:78:\"a:4:{i:1;s:7:\"default\";

And happily changes it to:

s:21:"a:4{i:1;s:7:\"default";

And the final query ends up as:

INSERT INTO `wp_options` VALUES (190,'iphorm_active_uniform_themes','s:21:"a:4:{i:1;s:7:\"default";i:2;s:7:"default";i:3;s:7:"default";i:4;s:7:"default";}\";','yes');

Wordpress would be able to unserialize this without error when retrieving the option, but the result would be truncated as you might expect:

string(21) "a:4:{i:1;s:7:"default"

Then, should the plugin try to unserialize this back into the original array, PHP will issue an E_NOTICE and return false. This could lead to undefined behavior, but I would expect a lot of code to revert to default settings (whether it's intentional or just a happy accident). The unfortunate part of this example is that this serialization never needed to be "fixed"!

I haven't verified, but this may be the cause of an issue we were experiencing where a page's header image and navigation menu wordpress options had been reverted to default after merging data from production, which had these values customizerd.

As for a resolution, I would suggest a more comprehensive approach than regular expressions, or at least some solution that involves more than a single regular expression. The potential length of data and the potential complexity of serialized information in that data can quickly achieve enormously poor performance.

@tamagokun
Copy link
Owner

Ah, I see. Serializing serialized data. Bad plugin design indeed, but I think the regex could be updated to make sure it is stopping at the outer most }; rather than the first one it sees (which is what is happening in this case).

I think a better approach would be to only modify serialized data chunks that were modified rather than just running through all of them. The only case they would be modified is if they contained the URL to the site and it needed to be updated from one environment to another.

@tcavalli
Copy link
Contributor Author

tcavalli commented Nov 1, 2013

Well, here're my two cents:

Presumably you want this project to work regardless of what a plugin developer does. To me, that means that even something as ridiculous as the following shouldn't cause problems:

for($surl=get_site_url(),$i=0;$i<10 && $surl=serialize($surl);++$i);
update_option('super_serialized_site_url', $surl);

After the for loop, $surl contains something like the following:

s:82:"s:74:"s:66:"s:58:"s:50:"s:42:"s:34:"s:26:"s:18:"s:10:"mysite.com";";";";";";";";";";

Which would cause problems if you're looking for curly braces.

But even with curly brackets, how do you know when you've reached the "outermost" closing brace?

$options = array(
    'foo' => array('bar'=>'baz'),
    $siteurl => array(
        array(array(serialize($siteurl))),
        array(array(array(1,2,3)))
    ));
update_option('ridiculous_option', serialize($options));

Now we're asking to store this value as an option:

a:2:{s:3:"foo";a:1:{s:3:"bar";s:3:"baz";}s:10:"mysite.com";a:2:{i:0;a:1:{i:0;a:1:{i:0;s:18:"s:10:"mysite.com";";}}i:1;a:1:{i:0;a:1:{i:0;a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}}}}}

And if you take a look at wordpress's maybe_serialize function (which is called by add_option and update_option), this value would be serialized once more before being inserted into the database!

To complicate matters even more, remember that this is being parsed in the middle of a database dump which you'd expect to contain many more data serializations. If your regular expression misses the closing bracket (or expects one that isn't there), obviously you won't get the result you expect, but even if you detect this error in code or even make the regular expression stop upon detecting an unescaped single quote and your code ignores it (perhaps to try something else), that's still atrocious in terms of computational complexity/efficiency.

Finding and only fixing serialized data that has been modified is the way I'd go, but you still run into the same problems if you try to use a regular expression for the job. As I mentioned just above, the last example (well, both examples, actually) gets serialized once more before being stored in the database, so in a dump of that database, you'd be dealing with:

s:170:"a:2:{s:3:"foo";a:1:{s:3:"bar";s:3:"baz";}s:10:"mysite.com";a:2:{i:0;a:1:{i:0;a:1:{i:0;s:18:"s:10:"mysite.com";";}}i:1;a:1:{i:0;a:1:{i:0;a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}}}}}";

Changing two strings in here will possibly require changing 3 numbers (unless the modified string happens to be the same length). Using regular expressions for this task seem like a complete departure from their purpose of defining a formal language... but if you can think of a clever way to do it that's efficient and everything, more power to you!

@tamagokun
Copy link
Owner

So, the implementation is going to involve some logic, as this complex of an issue cannot be solved via regex alone. That being said, I want to keep it as lightweight as possible.

If there isn't a solid way to sift through all the serialized data, it might make more sense just to take this out altogether and leave it up to developers to search and replace as they please. Not putting URLs in content would be the easiest way to not even need this feature.

@tcavalli
Copy link
Contributor Author

tcavalli commented Nov 4, 2013

Yes, it will inevitably involve some logic, and I'm not just complaining here, I have been keeping this at the back of my mind, thinking of potential solutions. I have a few ideas but I'll need some time to play around and see how involved they get (or if they'll even work for such extreme edge cases). Hopefully I can come up with something brilliant :-)

This project seems like a good place for conversion to happen when moving data between environments, so I, personally, would like to see a lightweight solution here, too. I just wanted to illustrate the problem as best I could so that if you have time to think it over, maybe you'll beat me to a brilliant solution.

As an aside, as far as lightweight is concerned, you could already use PHP's file_get_contents function to take advantage of memory mapping functionality in the OS (if available), and, as the PHP documentation for str_replace notes: if you don't need any fancy replacing rules as with regular expressions, then you should just use the basic str_replace or stri_replace functions (as you can imagine, their implementations are going to be much more lightweight than one that has to implement or at least invoke a regex engine). So perhaps if we can figure out an elegant solution to this bug it'll come with other optimizations as well.

@tamagokun
Copy link
Owner

Sweet. Let me know if you come up with anything. I just finished some features and improvements of Pomander, and will be doing some work on the Wordpress plugin for Pomander very soon. I'll also take a look at this problem closer as I get some free time.

@8ig8
Copy link

8ig8 commented Apr 30, 2014

@tamagokun
Copy link
Owner

@8ig8 thanks for that. I think I may have come across that the first time I tried to implement the serialization fixes, but will take a closer look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants