Perl Best Practices [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Best Practices [Electronic resources] - نسخه متنی

Damian Conway

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید


8.4. Fixed-Width Data


Use unpack to extract fixed-width fields .


Fixed-width text data:


X123-S000001324700000199
SFG-AT000000010200009099
Y811-Q000010030000000033

is still widely used in many data processing applications. The obvious way to extract this kind of data is with Perl's built-in substr function. But the resulting code is unwieldy and surprisingly slow:



# Specify field locations...

Readonly my %FIELD_POS => (ident=>0, sales=>6, price=>16);
Readonly my %FIELD_LEN => (ident=>6, sales=>10, price=>8);
# Grab each line/record...
while (my $record = <$sales_data>) {
# Extract each field...
my $ident = substr($record, $FIELD_POS{ident}, $FIELD_LEN{ident});
my $sales = substr($record, $FIELD_POS{sales}, $FIELD_LEN{sales});
my $price = substr($record, $FIELD_POS{price}, $FIELD_LEN{price});
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)...

push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

Using regexes to capture the various fields produces slightly cleaner code, but the matches are still not optimally fast:



# Specify order and lengths of fields...

Readonly my $RECORD_LAYOUT
=> qr/\A (.{6}) (.{10}) (.{8}) /xms;
# Grab each line/record...
while (my $record = <$sales_data>) {
# Extract all fields...
my ($ident, $sales, $price)
= $record =~ m/ $RECORD_LAYOUT /xms;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)...

push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

The built-in unpack function is optimized for this kind of task. In particular, a series of 'A' specifiers can be used to extract a sequence of multicharacter substrings:



# Specify order and lengths of fields
...
Readonly my $RECORD_LAYOUT => 'A6 A10 A8';
# 6 ASCII, then 10 ASCII, then 8 ASCII
# Grab each line/record
...
while (my $record = <$sales_data>) {
# Extract all fields...

my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)
...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

Some fixed-width formats insert one or more empty columns between the fields of each record, to make the resulting data more readable to humans. For example:


X123-S 0000013247 00000199
SFG-AT 0000000102 00009099
Y811-Q 0000100300 00000033

When extracting fields from such data, you should use the '@' specifier to tell unpack where each field starts. For example:



# Specify order and lengths of fields
...
Readonly my $RECORD_LAYOUT
=> '@0 A6 @8 A10 @20 A8';
# At column zero extract 6 ASCII chars
# then at column 8 extract 10,
# then at column 20 extract 8.
# Grab each line/record
...
while (my $record = <$sales_data>) {
# Extract all fields
...
my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)
...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

This approach scales extremely well, and can also cope with non-spaced data or variant layouts (i.e., with reordered fields). In particular, the unpack function doesn't require that '@' specifiers be specified in increasing column order. This means that an unpack can roam back and forth through a string (much like seek-ing a filehandle) and thereby extract fields in any convenient order. For example:



# Specify order and lengths of fields...

Readonly my %RECORD_LAYOUT => (

# Ident Sales Price

Unspaced => ' A6 A10 A8',
# Legacy layout

Spaced => ' @0 A6 @8 A10 @20 A8',
# Standard layout

ID_last => '@21 A6 @0 A10 @12 A8',
# New, more convenient layout

);
# Select record layout
...
my $layout_name = get_layout($filename);
# Grab each line/record
...
while (my $record = <$sales_data>) {
# Extract all fields
...
my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT{$layout_name}, $record;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)
...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

The loop body is very similar to those in the earlier examples, except for the record layout now being looked up in a hash. The three variations in formatting and sequence have been cleanly factored out into a table.

Note that the entry for $RECORD_LAYOUT{ID_last}:


ID_last => '@21 C6 @0 C10 @12 C8' ,

makes use of non-monotonic '@' specifiers. By jumping to column 21 first, then back to column 0, and on again to column 12, this ID_last format ensures that the call to unpack within the loop:


my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT{$layout_name}, $record;

will extract the record ID before the sales amount and the price, even though the ID field comes

after those other two fields in the file.

/ 317