Perl Best Practices [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

8.4. Fixed-Width Data

Use unpack to extract fixed-width fields .

Fixed-width text data:

X123-S000001324700000199
SFG-AT000000010200009099
Y811-Q000010030000000033

is still widely used in many data processing applications. The obvious way to extract this kind of data is with Perl's built-in substr function. But the resulting code is unwieldy and surprisingly slow:



# Specify field locations...
Readonly my %FIELD_POS => (ident=>0,  sales=>6,   price=>16);
Readonly my %FIELD_LEN => (ident=>6,  sales=>10,  price=>8);
# Grab each line/record...
while (my $record = <$sales_data>) {
# Extract each field...
my $ident = substr($record, $FIELD_POS{ident}, $FIELD_LEN{ident});
my $sales = substr($record, $FIELD_POS{sales}, $FIELD_LEN{sales});
my $price = substr($record, $FIELD_POS{price}, $FIELD_LEN{price});
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

Using regexes to capture the various fields produces slightly cleaner code, but the matches are still not optimally fast:



# Specify order and lengths of fields...
Readonly my $RECORD_LAYOUT
=> qr/\A (.{6}) (.{10}) (.{8}) /xms;
# Grab each line/record...
while (my $record = <$sales_data>) {
# Extract all fields...
my ($ident, $sales, $price)
= $record =~ m/ $RECORD_LAYOUT /xms;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

The built-in unpack function is optimized for this kind of task. In particular, a series of 'A' specifiers can be used to extract a sequence of multicharacter substrings:



# Specify order and lengths of fields
...
Readonly my $RECORD_LAYOUT => 'A6 A10 A8';  # 6 ASCII, then 10 ASCII, then 8 ASCII
# Grab each line/record
...
while (my $record = <$sales_data>) {
# Extract all fields...

my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)
...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

Some fixed-width formats insert one or more empty columns between the fields of each record, to make the resulting data more readable to humans. For example:


X123-S  0000013247  00000199
SFG-AT  0000000102  00009099
Y811-Q  0000100300  00000033

When extracting fields from such data, you should use the '@' specifier to tell unpack where each field starts. For example:



# Specify order and lengths of fields
...
Readonly my $RECORD_LAYOUT
=> '@0 A6 @8 A10 @20 A8';  # At column zero extract 6 ASCII chars
# then at column 8 extract 10,
# then at column 20 extract 8.
# Grab each line/record
...
while (my $record = <$sales_data>) {
# Extract all fields
...
my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)
...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

This approach scales extremely well, and can also cope with non-spaced data or variant layouts (i.e., with reordered fields). In particular, the unpack function doesn't require that '@' specifiers be specified in increasing column order. This means that an unpack can roam back and forth through a string (much like seek-ing a filehandle) and thereby extract fields in any convenient order. For example:



# Specify order and lengths of fields...

Readonly my %RECORD_LAYOUT  => (

#  Ident   Sales   Price

Unspaced => '    A6     A10      A8',   # Legacy layout

Spaced => ' @0 A6  @8 A10  @20 A8',   # Standard layout

ID_last => '@21 A6  @0 A10  @12 A8',   # New, more convenient layout

);
# Select record layout
...
my $layout_name = get_layout($filename);
# Grab each line/record
...
while (my $record = <$sales_data>) {
# Extract all fields
...
my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT{$layout_name}, $record;
# Append each record, translating ID codes and
# normalizing sales (which are stored in 1000s)
...
push @sales, {
ident => translate_ID($ident),
sales => $sales * 1000,
price => $price,
};
}

The loop body is very similar to those in the earlier examples, except for the record layout now being looked up in a hash. The three variations in formatting and sequence have been cleanly factored out into a table.

Note that the entry for $RECORD_LAYOUT{ID_last}:


ID_last => '@21 C6  @0 C10  @12 C8' ,

makes use of non-monotonic '@' specifiers. By jumping to column 21 first, then back to column 0, and on again to column 12, this ID_last format ensures that the call to unpack within the loop:


my ($ident, $sales, $price)
= unpack $RECORD_LAYOUT{$layout_name}, $record;

will extract the record ID before the sales amount and the price, even though the ID field comes

after those other two fields in the file.

Perl Best Practices [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Best Practices [Electronic resources] - نسخه متنی

Damian Conway

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

فونت

اندازه قلم

حالت نمایش

8.4. Fixed-Width Data