I have a set of HTML tables that store survey questions and responses over time. Each question has it's own HTML table, the columns are the years, the rows are the responses, then the individual cells have the number of responses for that year, as shown below:
I've gone back and forth on how to normalize this data and store it in a database, but I'm not sure what the best way is. I'm looking for a good database schema that can handle additional questions, responses, and years as time goes by. I'm also looking for a good query that can output an HTML table like below. I can do it easily enough in a PHP loop, but I'm worried that isn't good for performance.
Right now, I have the following table design:
question
id int(11) unsigned AI PK name varchar(255) UNQ number varchar(255) UNQ text
year
id int(11) unsigned AI PK question_id int(11) unsigned FK name varchar(255) UNQ (question_id + name)
response
id int(11) unsigned AI PK question_id int(11) unsigned FK name varchar(255) UNQ (question_id + name)
data
id int(11) unsigned AI PK question_id int(11) unsigned FK year_id int(11) unsigned FK response_id int(11) unsigned FK UNQ (year_id + response_id) count int(11) unsigned NULL
Any help or improvements would be greatly appreciated.
3 Answers
Answers 1
You don't need the table year
, since a year is question-independently.
And alter table data
year_id int(11) unsigned FK
toyear SMALLINT(4) UNSIGNED
UNQ (year_id + response_id)
toUNQ (year + response_id)
Answers 2
In general, if you have a UNIQUE
key (is that what 'UNQ' means??), use if for the PRIMARY KEY
.
"names" usually don't need to be VARCHAR(255)
. Pick a smaller size.
"numbers" usually don't need to be VARCHAR(255)
. Pick a more appropriate datatype.
Write you schema in CREATE TABLE
syntax; I am having severe trouble parsing your run-on description.
What does "0.00" represent? Is it derivable from the other data? If so, do not store it.
From the second images provide, I would guess you have 1 table:
CREATE TABLE foo ( year YEAR NULL, gender ENUM('male', 'female') NOT NULL, val SMALLINT UNSIGNED NOT NULL, PRIMARY KEY(year, gender) ) ENGINE=InnoDB;
I don't understand the meaning of '1959-1974', but it might be
cohort VARHAR(20) NOT NULL
and replace gender
in a second table that otherwise looks like the above table.
But... You can't really design a schema without understanding what will be done with the data. Do you have any tentative SELECTs
?
Answers 3
there's good ideas already for a structured version your target data model - if you wanted the structure of your stats to be a little more flexible, but still be able to key and group over time, then an alternative might be to follow bi/ dw pattern to model your data
the following are 'logical' and would correlate with the attributes/ dimensions in a fact table per kimball et. al., where the 'grain' of the fact table is 'src html file + table + row + cell + value(s)', assuming your values are consistent over time
(i notice in your image that there are a couple tables for the one html file, and a couple of values in each cell)
group_srcfile
(points to the location in the source html file/ table/ row/ cell, and you could probably store source html as well, in case you need to perform an autopsy later)group_cohort
(points to the normalised cohort eg. '18-24 yr olds over time', or 'males over time')group_question
(points to question definition - this is all the same questions over time)question_id
(question definition + question year)question_year
(this is the year that the question was asked)cohort_start_year
(this is the start-year of the cohort was asked the question)cohort_end_year
(this is the end-year of the cohort was asked the question)cohort_start_age
(if applicable, would be the normalised 'xxxx - yyyy', eg: '18')cohort_end_age
(this is either specified, or inferred by a 'xxxx - present' where 'present' is the year of the report html file)values 1 .. n must count the same things, otherwise you would need to split these off as well
to generate decent output you would need to finalise the questions on your data table, but whatever you do, it would be relatively straightforward to export html using php
i thought about the method by which you load the data into mysql, but without solid samples of the html files which are the source of your data, it's hard to write specific code (ie. open in your browser and 'view source', or equivalent)
as a general approach i would parse each fact (table cell td
) from the html using php and DOMDocument
, then emit a row in denormalised form, for subsequent loading into a staging table and ultimately your fact table
in this context, 'emit' is the source of what ultimately becomes an individual row in your fact table but you can't load it yet because you don't know what your dimension keys will be unless you define them at time of parsing the html
this is virtually impossible to do: instead, load into a loosely defined table (without any ref. integrity) and once you've finished parsing all files, write etl or queries that will generate your dimension tables before finishing up with your facts
(i would probably use pentaho data integration to handle the second phase - its streaming xml parser couldn't handle the first: too strict)
i found this test html file which was sufficiently aged, as to cause me to go and puke my last coffee at the mere thought of endlessly re-writing scraping code to account for the never-ending layout changes courtesy of 'dreamtheaver' ...
once my hands had stabilised enough, and blood-flow had returned to normal, i channeled the machine spirit and produced the following php - notably absent is any sort of restructuring/ denormalising of the source table:
<?php ini_set('display_errors', 1); ini_set('display_startup_errors', 1); error_reporting(E_ALL); $dom = new DOMDocument(); $srcfile = 'testxmlparser.html'; $dom->loadHTMLFile( $srcfile, NULL ); echo 'odom is: ' . ($dom ? 'nice':'naughty') . PHP_EOL; if( $dom ) { // get all the table rows in the document $tblrows = $dom->getElementsByTagName('tr'); foreach( $tblrows as $trrow ) { $tblcells = $trrow->getElementsByTagName('td'); $incr = 0; // buffer this table row's cell (td) data that we encounter, in case it is interesting... $srowbuf = ''; foreach( $tblcells as $tdcell ) { $srowbuf = ($srowbuf . $tdcell->nodeValue); if( 1 <= $incr++ ) $srowbuf = ($srowbuf . '+|'); } // we know the table data we're interested in has 12 cells only if( 12 == $incr ) echo $srowbuf . '+|' . $incr . '+|' . $srcfile . PHP_EOL; } } ?>
0 comments:
Post a Comment