Recipe 1.11. Checking Whether a String Is Text or Binary
Credit: Andrew Dalke
Problem
Python can use a plain string to hold either
text or arbitrary bytes, and you need to determine (heuristically, of
course: there can be no precise algorithm for this) which of the two
cases holds for a certain string.
Solution
We can use the same heuristic criteria as Perl does, deeming a string
binary if it contains any nulls or if more than 30% of its characters
have the high bit set (i.e., codes greater than 126) or are strange
control codes. We have to code this ourselves, but this also means we
easily get to tweak the heuristics for special application needs:
from _ _future_ _ import division # ensure / does NOT truncate
import string
text_characters = ".join(map(chr, range(32, 127))) + "\n\r\t\b"
_null_trans = string.maketrans(", ")
def istext(s, text_characters=text_characters, threshold=0.30):
# if s contains any null, it's not text:
if "\0" in s:
return False
# an "empty" string is "text" (arbitrary but reasonable choice):
if not s:
return True
# Get the substring of s made up of non-text characters
t = s.translate(_null_trans, text_characters)
# s is 'text' if less than 30% of its characters are non-text ones:
return len(t)/len(s) <= threshold
Discussion
You can easily do minor customizations to the heuristics used by
function istext by passing in specific values for
the threshold, which defaults to 0.30 (30%), or for
the string of those characters that are to be deemed
"text" (which defaults to normal
ASCII characters plus the four
"normal" control characters,
meaning ones that are often found in text). For example, if you
expected Italian text encoded as ISO-8859-1, you could add the
accented letters used in Italian,
"àèéìÃ2Ã1",
to the text_characters argument.Often, what you need to check as being either binary or text is not a
string, but a file. Again, we can use the same heuristics as Perl,
checking just the first block of the file with the
istext function shown in this
recipe's Solution:
def istextfile(filename, blocksize=512, **kwds):Note that,
return istext(open(filename).read(blocksize), **kwds)
by default, the expression len(t)/len(s) used in
the body of function istext would truncate the
result to 0, since it is a division between integer numbers. In some
future version (probably Python 3.0, a few years away), Python will
change the meaning of the / operator so that it
performs division without truncationif you really do want
truncation, you should use the truncating-division operator,
//.However, Python has not yet changed the semantics of division,
keeping the old one by default in order to ensure backwards
compatibility. It's important that the millions of
lines of code of Python programs and modules that already exist keep
running smoothly under all new 2.x versions of Pythononly upon
a change of major language version number, no more often than every
decade or so, is Python allowed to change in ways that
aren't backwards-compatible.Since, in the small module containing this recipe's
Solution, it's handy for us to get the division
behavior that is scheduled for introduction in some future release,
we start our module with the statement:
from _ _future_ _ import divisionThis statement
doesn't affect the rest of the program, only the
specific module that starts with this statement; throughout this
module, / performs "true
division" (without truncation). As of Python 2.3 and
2.4, division is the only thing you may want to
import from _ _future_ _. Other features that used
to be scheduled for the future, nested_scopes and
generators, are now part of the language and
cannot be turned offit's innocuous to import
them, but it makes sense to do so only if your program also needs to
run under some older version of Python.
See Also
Recipe 1.10 for more
details about function maketrans and string method
translate; Language
Reference for details about true versus truncating
division.